RAID

RAID structure

_$: for md in /dev/md?; do echo $md ; mdadm --detail $md | grep "/dev/sd"; done
/dev/md0
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
/dev/md1
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
/dev/md2
       0       8        5        0      active sync   /dev/sda5
       1       8       21        1      active sync   /dev/sdb5
/dev/md3
       0       8        6        0      active sync   /dev/sda6
       1       8       22        1      active sync   /dev/sdb6

RAID status

_$: for md in /dev/md? ; do echo $md ; mdadm --detail $md | grep "State :"; done
/dev/md0
          State : clean
/dev/md1
          State : clean
/dev/md2
          State : clean
/dev/md3
          State : clean

We can also use /proc:

_$: cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md3 : active raid1 sda6[2] sdb6[1]
      566473536 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sda5[2] sdb5[1]
      97589120 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sda3[2] sdb3[1]
      292837184 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sda1[2] sdb1[1]
      3903424 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Replace a failed hard disk

The /dev/sda hard disk has failed, however /dev/sdb is OK.

Initial status

_$: for md in /dev/md?; do state=$(mdadm --detail $md | grep "State : " | cut -f2 -d':'); printf "%s: %s\n" $md $state; done
/dev/md0: clean, degraded
/dev/md1: clean, degraded
/dev/md2: clean, degraded

_$: cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1 sdb6[0]
      953193280 blocks super 1.2 [2/1] [U_]

md0 : active raid1 sdb1[0]
      1950656 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sdb5[0]
      19513216 blocks super 1.2 [2/1] [U_]

Remove the hard disk from the partitions

_$: mdadm --manage /dev/md0 --fail /dev/sda1
_$: mdadm --manage /dev/md1 --fail /dev/sda5
_$: mdadm --manage /dev/md2 --fail /dev/sda6

Remove the partitions from the RAID

_$: mdadm --manage /dev/md0 --remove /dev/sda1
_$: mdadm --manage /dev/md1 --remove /dev/sda5
_$: mdadm --manage /dev/md2 --remove /dev/sda6

We can also achieve the same with just one command:

_$: mdadm --manage /dev/md0 --fail /dev/sda1 --remove /dev/sda1

Now we power off the computer and replace the failed hard disk.

Copy the partition table from the old disk to the new one

_$: sfdisk -d /dev/sdb | sfdisk /dev/sda  # Da errores? => Forzar
_$: sfdisk -d /dev/sdb | sfdisk --force /dev/sda
_$: sfdisk -l /dev/sda

Disk /dev/sda: 121601 cylinders, 255 heads, 63 sectors/track
Warning: extended partition does not start at a cylinder boundary.
DOS and Linux will interpret the contents differently.
Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0

   Device Boot Start     End   #cyls    #blocks   Id  System
/dev/sda1          0+    243-    243-   1951744   fd  Linux raid autodetect
/dev/sda2        243+    486-    244-   1952768   82  Linux swap / Solaris
/dev/sda3        486+ 121601- 121115- 972855297    5  Extended
/dev/sda4          0       -       0          0    0  Empty
/dev/sda5        486+   2917-   2432-  19529728   fd  Linux raid autodetect
/dev/sda6       2917+ 121601- 118684- 953324544   fd  Linux raid autodetect

Add the partitions to the RAID

_$: mdadm --manage /dev/md0 --add /dev/sda1
_$: mdadm --manage /dev/md1 --add /dev/sda5
_$: mdadm --manage /dev/md2 --add /dev/sda6

_$: cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1 sda6[2] sdb6[0]
      953193280 blocks super 1.2 [2/1] [U_]
      	resync=DELAYED

md0 : active raid1 sda1[2] sdb1[0]
      1950656 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sda5[2] sdb5[0]
      19513216 blocks super 1.2 [2/1] [U_]
      [==>..................]  recovery = 13.0% (2543808/19513216) finish=1.8min speed=149635K/sec

Install GRUB in the new hard disk

Check the GRUB version that we are using is GRUB2:

_$: grub-install -v
grub-install (GRUB) 1.99-21ubuntu3.10

If we were using GRUB Legacy (GRUB 1) we would see something like:

_$: grub-install -v
grub-install (GNU GRUB 0.97)

If we are not using GRUB2, the first step is to update to GRUB2:

_$: apt-get update
_$: apt-get purge grub-common
_$: apt-get install grub-pc

a) Install GRUB using grub-install:

We will install GRUB in the MBR of the hard disk, not in the /boot partition. GRUB will be installed in the MBR and the files will be in /boot or wherever we say with the --boot-directory flag.

_$: grub-install /dev/sda
Installation finished. No error reported.

b) Install GRUB manually (not recommended). Only if you are using GRUB Legacy.

_$: grub
grub> find /boot/grub/stage1
root (hd0,1)
grub> root (hd0,1)
grub> setup (hd0)
grub> quit

Check GRUB has been properly installed

_$: grub-install --recheck /dev/sda
Installation finished. No error reported.
_$: grub-install --recheck /dev/sdb
Installation finished. No error reported.

Force RAID synchronization

_$: echo 'check' > /sys/block/md1/md/sync_action

Repair a RAID after a power outage

It is possible that after a power outage, the RAID remains in bad state.

_$: cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sda1[2] sdb1[3]
      1950656 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sda5[2]
      19513216 blocks super 1.2 [2/1] [_U]

md2 : active raid1 sda6[2]
      953193280 blocks super 1.2 [2/1] [_U]

unused devices: <none>

Check the file system is not in read-only mode

_$: touch a
touch: cannot touch `a': Read-only file system

If it is, reboot the computer

_$: reboot

Mount the RAID

_$: mdadm --manage /dev/md1 --add /dev/sdb5
mdadm: added /dev/sdb5
_$: mdadm --manage /dev/md2 --add /dev/sdb6
mdadm: added /dev/sdb6
_$: cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sda1[2] sdb1[3]
      1950656 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb5[3] sda5[2]
      19513216 blocks super 1.2 [2/1] [_U]
      [=>...................]  recovery =  8.5% (1665408/19513216) finish=2.3min speed=128108K/sec

md2 : active raid1 sdb6[3] sda6[2]
      953193280 blocks super 1.2 [2/1] [_U]
      	resync=DELAYED

unused devices: <none>

Run a long test in the failed hard disk

_$: smartctl -t long  /dev/sdb
_$: smartctl -a /dev/sda | grep -A 1 "Self-test execution status"

Repair the RAID with a hard disk failed (initramfs)

_$: mdadm --manage /dev/md0 --fail /dev/sda1
mdadm: set device faulty failed for /dev/sda1:  No such device
_$: mdadm --manage /dev/md0 --remove /dev/sda1
mdadm: hot remove failed for /dev/sda1: No such device or address

This means that the /dev/sda hard disk has already been removed from the RAID, so we just need to add it:

_$: mdadm --manage /dev/md0 --add /dev/sda1
mdadm: added /dev/sda1

And do the same for the rest of the partitions: /dev/sda3, /dev/sda5, etc.

Email

Check there is a mdadm script in /etc/cron.daily.

/etc/cron.daily/mdadm:
----------------------
#!/bin/sh
#
# cron.daily/mdadm -- daily check that MD devices are functional
#
# Copyright © 2008 Paul Slootman <paul@debian.org>
# distributed under the terms of the Artistic Licence 2.0

# As recommended by the manpage, run
#      mdadm --monitor --scan --oneshot
# every day to ensure that any degraded MD devices don't go unnoticed.
# Email will go to the address specified in /etc/mdadm/mdadm.conf .
#
set -eu

MDADM=/sbin/mdadm
[ -x $MDADM ] || exit 0 # package may be removed but not purged

exec $MDADM --monitor --scan --oneshot

If there is, every day a RAID check will be run and an email will be sent in case of problems. But we still need to configure the email:

/etc/mdadm/mdadm.conf:
----------------------
...
# instruct the monitoring daemon where to send mail alerts
MAILADDR <user>@example.com
...

Finally we have to check in the /etc/crontab file the time at which the scripts placed in /etc/cron.daily will be run. The computer should be on at that time.

Repair a hard disk in `spare` state

_$: mdadm --detail /dev/md3
/dev/md3:
        Version : 1.2
  Creation Time : Wed Sep 25 17:57:16 2013
     Raid Level : raid1
     Array Size : 566473536 (540.23 GiB 580.07 GB)
  Used Dev Size : 566473536 (540.23 GiB 580.07 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Tue Mar  3 10:57:49 2015
          State : clean, degraded
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

           Name : user-nix:3  (local to host user-nix)
           UUID : b9ac204b:cda03d57:5b3304d8:1fbfa57f
         Events : 16916

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       22        1      active sync   /dev/sdb6

       2       8        6        -      spare   /dev/sda6

_$: mdadm --re-add /dev/md3 /dev/sda6
Cannot open /dev/sda6: Device or resource busy

_$: mdadm --manage /dev/md3 --fail /dev/sda6 --remove /dev/sda6
_$: mdadm --manage /dev/md3 --add  /dev/sda6