Software RAID

How to set up and use software RAID on debian.

some raid examples

Creating RAID devices:

mdadm --create /dev/md4 --level=1 --raid-devices=2 /dev/sd[ab]8
mdadm --create /dev/md4 --level=5 --raid-devices=3 /dev/sd[abc]8

Stopping RAID:
mdadm --stop /dev/md1

Removing a physical device:

mdadm --fail /dev/md4 /dev/sdc8
mdadm --remove /dev/md4 /dev/sdc8

Adding it back in:
 mdadm --add /dev/md4 /dev/sdc1

booting off raid1

Grub works with software RAID1, so long as grub is installed on each disk.

But there is a problem. Lets say you have sda and sdb. If sdb fails, then if grub is installed on sda then it should boot fine. However, if sba fails, then sdb typically becomes sba. If grub was installed on sdb while it was mapped to device hd1, it will not be able to boot correctly (since there is no longer a hd1).

The trick is to make grub think that it is installing on (hd0) when it is installing on sdb.

# grub
grub> device (hd0) /dev/sdb
grub> root (hd0,0)
grub> setup (hd0)

see also: lists.us.dell.com/pipermail/linux-power...

updating mdadm.conf

If you create a new raid device debian will not notice it on boot unless you also update the file /etc/mdadm/mdadm.conf.

cd /etc/mdadm
cp mdadm.conf mdadm.conf.`date +%y%m%d`
echo "DEVICE partitions" > mdadm.conf
echo "MAILADDR root" >> mdadm.conf
echo "CREATE owner=root group=disk mode=0660 auto=yes" >> mdadm.comf
mdadm --detail --scan >> mdadm.conf

Then look at the file to see if it looks right and to make sure there isn’t extra stuff in there.

Here is an example /etc/mdadm/mdadm.conf:

DEVICE partitions
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=6b8b4567:327b23c6:643c9869:66334873
   devices=/dev/sda1,/dev/sdb1
ARRAY /dev/md1 level=raid1 num-devices=2 UUID=6b8b4567:327b23c6:643c9869:66334873
   devices=/dev/sda5,/dev/sdb5
ARRAY /dev/md2 level=raid1 num-devices=2 UUID=6b8b4567:327b23c6:643c9869:66334873
   devices=/dev/sda6,/dev/sdb6
ARRAY /dev/md3 level=raid1 num-devices=2 UUID=6b8b4567:327b23c6:643c9869:66334873
   devices=/dev/sda7,/dev/sdb7
ARRAY /dev/md4 level=raid5 num-devices=3 UUID=b1bfcde9:088dc404:2b4bed20:2f1c5da5
   devices=/dev/sda8,/dev/sdb8,/dev/sdc8

hot swap scsi

Hot swapping SCSI drives is not directly related to software RAID, but it is a common task when working with RAID.

To remove a hot swap drive:

# echo "scsi remove-single-device 0 0 X 0" > /proc/scsi/scsi

To add a hot swap drive:

 # echo "scsi add-single-device 0 0 X 0" > /proc/scsi/scsi

Where X, starting at 0, is the drive number. The parameters may differ (the channel number might not be zero, for example). cat /proc/scsi/scsi first to see what makes sense.

hot swap sata

sata drives in lenny (2.6.26 kernel) behave as scsi disks and you still use the above method to remove and add them. But in squeeze (2.6.32) and newer sata drives use different interfaces.

IMPORTANT: you need to have your BIOS set to have the sata drives in AHCI mode (rather than IDE mode). Without this they won’t properly detect hotplug events and do the right thing.

To hot remove a drive, make sure it is not in use and then just yank it out. The bus will notice and rescan.

If you need to force remove a hot swap drive or prefer to not cause a bus reset:

# echo 1 > /sys/block/sda/device/delete

To hot add a drive just insert it and the bus will detect it and rescan.

You can also force the bus to rescan with:

# echo "- - -" > /sys/class/scsi_host/host0/scan

(info from here)

when device names change

Sometimes when you upgrade a kernel, the names of the devices which compose the array will change. This makes it so that the array can’t be constructed at boot.

You can see this by examining /proc/partitions and comparing it to /etc/mdadm/mdadm.conf.

To re-assemble the array, simply run this command for each array:

/sbin/mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1

Hopefully, root will still mount even if all the other partitions don’t. If you have a root, you will probably have the command /sbin/mdadm.

In this case, we are re-assembling md0 from sda1 and sdb1. Once you have assembled the raid device, mount it to make sure it is right. When you are satified, create a new mdadm.conf using the method listed above.

an example recovery scenario

Two RAID1 devices:

md0  sda1 sdb1  /
md1  sda2 sdb2  /var

Lets suppose disk sda dies!

  1. You take out dead sda and reboot. Now all the raid devices are running in degraded mode and sdb has become sda. SCSI likes to name the disks in the order they are encountered, which is weird. So, even though we lost sda, we reboot and it will show only sda in the raid devices.
  2. Then you hot add a new disk in the first slot. Since there is already an sda this becomes sdb. So we have the situation where sda and sdb are switched. If we reboot, they will return to their normal order (ie the first scsi slot will be sda).
  3. Partition the new disk:
    If the disks are exactly the same (same size and also same geometry) you can use:
    cat /var/backups/partitions.sda.txt | sfdisk /dev/sdb

    You read that right! We use old sda’s partition table to partition the new sda, which is currently sdb.
    If the disks are the same size but different geometry, you can create the new partitions by hand using fdisk. Most likely the partitions aren’t going to be an even multiple of cylinders, but you can switch fdisk into sector mode by hitting ‘u’. Then with some math, you can create partitions that are the same size.
  4. Rebuild the raid:
    mdadm --manage /dev/md0 --add /dev/sdb1
    mdadm --manage /dev/md1 --add /dev/sdb2
    

    So now our RAID devices have sda added as sdb, and sdb added as sda. It turns out, it doesn’t matter.
  5. Reboot. The devices revert to their normal order, and the RAID devices work fine.

rebuilds of large disks

If a server crashes while doing writes to the md disk, after you reboot the machine RAID1 arrays will need to resync. On a large disk this can take a really long time(up to a few hours). In etch and squeeze+ this rebuild can run in the background while the system continues to boot and function normally. But in lenny, there was a bug on some systems that caused the rebuild to consume all I/O on the system and basically make the system unusable until the rebuild finished. This was also triggered by the monthly mdadm cronjob that does a check on the array, which was particularly annoying.

There is a feature of mdadm to help make rebuilds faster, called “write intent bitmaps”. The idea is that right before the system is about to write to part of the md array, it writes to a bitmap stating it’s intent to write to that area and then after it makes the write it clears that intent in the bitmap. So if the server crashes, upon reboot the bitmap can be consulted and then only the areas of the array that had active writes need to be rebuilt rather than the whole thing. Now there is a problem with doing this; if the bitmap is on the same disk as the array, this means that the drive heads need to write to the bitmap on one part of the disk, then jump to the array to do the write, then jump back to the bitmap. This results in a performance loss, reported to be ~5% or so. But the cool thing is that there is support for doing external write intent bitmaps, so you can put the bitmap on another disk (which needs to be more idle or less performance critical than the array in order for this to be an advantage).

Here’s a short howto, more info is in the mdadm manpage too.

While when it’s working properly rebuilds can run well in the background, this is still a window in which if the system crashes or a disk dies, you might lose the array. Rebuilds also consume a lot of I/O, which might hurt on a heavily loaded server, so it’s silly to rebuild the entire array if you don’t need to. If raid rebuild time is important to you, you might consider adding this feature.

There is another idea similar to write intent bitmaps called journal guided resync but it’s still pretty new.

links

Resizing and encrypted RAID5