Difference between revisions of "Replacing a failed disk in a software mirror"
(→Solaris) |
|||
(7 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | |||
+ | == Solaris == | ||
+ | |||
+ | |||
So here's the scenario, you have a server with disk mirrored with Solaris Disksuite, c0t0d0 and c0t1d0. The disk c0t1d0 has failed and you want to replace it without shutting down the box. We have the following metadevices and sub-mirrors: | So here's the scenario, you have a server with disk mirrored with Solaris Disksuite, c0t0d0 and c0t1d0. The disk c0t1d0 has failed and you want to replace it without shutting down the box. We have the following metadevices and sub-mirrors: | ||
Line 8: | Line 12: | ||
* d50: d51 and d52 | * d50: d51 and d52 | ||
− | This procedure had been tested and works on a SunFire V125. You may have to alter this slightly depending on the hardware you are running. Consult the User Guide for you server on how best to replace a drive on a running system. | + | This procedure had been tested and works on a SunFire V125 running Solaris 10. You may have to alter this slightly depending on the hardware you are running. Consult the User Guide for you server on how best to replace a drive on a running system. |
1. Delete the meta databases stored on the failed disk, stored in this case on slice 7 of the disk | 1. Delete the meta databases stored on the failed disk, stored in this case on slice 7 of the disk | ||
Line 61: | Line 65: | ||
usb0/1 unknown empty unconfigured ok | usb0/1 unknown empty unconfigured ok | ||
usb0/2 unknown empty unconfigured ok | usb0/2 unknown empty unconfigured ok | ||
+ | |||
+ | '''NOTE:''' You may get this error: | ||
+ | |||
+ | # cfgadm -c unconfigure c0::dsk/c0t0d0 | ||
+ | cfgadm: Component system is busy, try again: failed to offline: | ||
+ | Resource Information | ||
+ | ------------------ ----------------------- | ||
+ | /dev/dsk/c0t1d0s1 dump device (dedicated) | ||
+ | |||
+ | This means that the slice in question has been set as the dump device, instead of the mirror slice. You can confirm it with: | ||
+ | |||
+ | # '''dumpadm''' | ||
+ | Dump content: kernel pages | ||
+ | Dump device: /dev/dsk/c0t1d0s1 (dedicated) | ||
+ | Savecore directory: /var/crash/hostname | ||
+ | Savecore enabled: yes | ||
+ | Save compressed: on | ||
+ | |||
+ | You fix it with: | ||
+ | |||
+ | # '''dumpadm -d /dev/md/dsk/d10''' | ||
+ | Dump content: kernel pages | ||
+ | Dump device: /dev/md/dsk/d10 (swap) | ||
+ | Savecore directory: /var/crash/hostname | ||
+ | Savecore enabled: yes | ||
+ | Save compressed: on | ||
+ | |||
+ | Now you should be able to unconfigure the drive. | ||
7. Physically replace the failed drive | 7. Physically replace the failed drive | ||
Line 125: | Line 157: | ||
then it is using the wrong version of 'dd'. Change your PATH to put /usr/bin/dd first in your PATH. | then it is using the wrong version of 'dd'. Change your PATH to put /usr/bin/dd first in your PATH. | ||
+ | |||
+ | == CentOS/RedHat == | ||
+ | |||
+ | So you have a failed disk in your software raid. When you look at /proc/mdstat it looks something like: | ||
+ | |||
+ | # cat /proc/mdstat | ||
+ | Personalities : [raid1] | ||
+ | md1 : active raid1 sdb1[1] sda1[2](F) | ||
+ | 104320 blocks [2/1] [_U] | ||
+ | |||
+ | md4 : active raid1 sdb2[1] sda2[2](F) | ||
+ | 6289344 blocks [2/1] [_U] | ||
+ | |||
+ | md5 : active raid1 sdb5[1] sda5[2](F) | ||
+ | 4192832 blocks [2/1] [_U] | ||
+ | |||
+ | md2 : active raid1 sdb6[1] sda6[0] | ||
+ | 2096384 blocks [2/2] [UU] | ||
+ | |||
+ | md3 : active raid1 sdb7[1] sda7[2](F) | ||
+ | 16570944 blocks [2/1] [_U] | ||
+ | |||
+ | md0 : active raid1 sdb3[1] sda3[2](F) | ||
+ | 6289344 blocks [2/1] [_U] | ||
+ | |||
+ | unused devices: <none> | ||
+ | |||
+ | This indicates that /dev/sda has failed. | ||
+ | |||
+ | To replace it, do the following: | ||
+ | |||
+ | 1. Use mdstat to fail and remove each of the slices associated with the failed disk | ||
+ | |||
+ | mdadm --manage /dev/md1 --fail /dev/sda1 | ||
+ | mdadm --manage /dev/md1 --remove /dev/sda1 | ||
+ | |||
+ | /proc/mdstat should look something like this: | ||
+ | |||
+ | # cat /proc/mdstat | ||
+ | Personalities : [raid1] | ||
+ | md1 : active raid1 sdb1[1] | ||
+ | 104320 blocks [2/1] [_U] | ||
+ | |||
+ | md4 : active raid1 sdb2[1] sda2[2](F) | ||
+ | 6289344 blocks [2/1] [_U] | ||
+ | |||
+ | md5 : active raid1 sdb5[1] sda5[2](F) | ||
+ | 4192832 blocks [2/1] [_U] | ||
+ | |||
+ | md2 : active raid1 sdb6[1] sda6[0] | ||
+ | 2096384 blocks [2/2] [UU] | ||
+ | |||
+ | md3 : active raid1 sdb7[1] sda7[2](F) | ||
+ | 16570944 blocks [2/1] [_U] | ||
+ | |||
+ | md0 : active raid1 sdb3[1] sda3[2](F) | ||
+ | 6289344 blocks [2/1] [_U] | ||
+ | |||
+ | unused devices: <none> | ||
+ | |||
+ | Repeat for each slice of the failed disk. | ||
+ | |||
+ | /proc/mdstat should now look like this: | ||
+ | |||
+ | # cat /proc/mdstat | ||
+ | Personalities : [raid1] | ||
+ | md1 : active raid1 sdb1[1] | ||
+ | 104320 blocks [2/1] [_U] | ||
+ | |||
+ | md4 : active raid1 sdb2[1] | ||
+ | 6289344 blocks [2/1] [_U] | ||
+ | |||
+ | md5 : active raid1 sdb5[1] | ||
+ | 4192832 blocks [2/1] [_U] | ||
+ | |||
+ | md2 : active raid1 sdb6[1] | ||
+ | 2096384 blocks [2/1] [_U] | ||
+ | |||
+ | md3 : active raid1 sdb7[1] | ||
+ | 16570944 blocks [2/1] [_U] | ||
+ | |||
+ | md0 : active raid1 sdb3[1] | ||
+ | 6289344 blocks [2/1] [_U] | ||
+ | |||
+ | unused devices: <none> | ||
+ | |||
+ | 2. Shutdown the server and replace the failed disk | ||
+ | |||
+ | shutdown -h now | ||
+ | |||
+ | 3. Startup the server and boot off the good disk. | ||
+ | |||
+ | 4. Copy the partition table from the good disk to the replacement disk | ||
+ | |||
+ | sfdisk -d /dev/sdb | sfdisk /dev/sda | ||
+ | |||
+ | 5. To make sure that there are no remains from previous RAID installations on /dev/sda, we run the following commands: | ||
+ | |||
+ | mdadm --zero-superblock /dev/sda1 | ||
+ | mdadm --zero-superblock /dev/sda2 | ||
+ | mdadm --zero-superblock /dev/sda3 | ||
+ | mdadm --zero-superblock /dev/sda5 | ||
+ | mdadm --zero-superblock /dev/sda6 | ||
+ | mdadm --zero-superblock /dev/sda7 | ||
+ | |||
+ | If there was nothing to zero, you'll get this response: | ||
+ | |||
+ | mdadm: Unrecognised md component device - /dev/sda1 | ||
+ | |||
+ | All good! | ||
+ | |||
+ | 6. Add the new disks partitions to the metadevices | ||
+ | |||
+ | mdadm --add /dev/md0 /dev/sda3 | ||
+ | mdadm --add /dev/md1 /dev/sda1 | ||
+ | mdadm --add /dev/md2 /dev/sda6 | ||
+ | mdadm --add /dev/md3 /dev/sda7 | ||
+ | mdadm --add /dev/md4 /dev/sda2 | ||
+ | mdadm --add /dev/md5 /dev/sda5 | ||
+ | |||
+ | When it's finished synchronising, /proc/mdstat should look something like: | ||
+ | |||
+ | cat /proc/mdstat | ||
+ | Personalities : [raid1] | ||
+ | md1 : active raid1 sda1[0] sdb1[1] | ||
+ | 104320 blocks [2/2] [UU] | ||
+ | |||
+ | md4 : active raid1 sda2[0] sdb2[1] | ||
+ | 6289344 blocks [2/2] [UU] | ||
+ | |||
+ | md5 : active raid1 sda5[0] sdb5[1] | ||
+ | 4192832 blocks [2/2] [UU] | ||
+ | |||
+ | md2 : active raid1 sda6[0] sdb6[1] | ||
+ | 2096384 blocks [2/2] [UU] | ||
+ | |||
+ | md3 : active raid1 sda7[0] sdb7[1] | ||
+ | 16570944 blocks [2/2] [UU] | ||
+ | |||
+ | md0 : active raid1 sda3[0] sdb3[1] | ||
+ | 6289344 blocks [2/2] [UU] | ||
+ | |||
+ | unused devices: <none> | ||
+ | |||
+ | 7. Now install grub on the master boot record (MBR) of the new disk | ||
+ | |||
+ | grub | ||
+ | |||
+ | Probing devices to guess BIOS drives. This may take a long time. | ||
+ | |||
+ | |||
+ | GNU GRUB version 0.97 (640K lower / 3072K upper memory) | ||
+ | |||
+ | [ Minimal BASH-like line editing is supported. For the first word, TAB | ||
+ | lists possible command completions. Anywhere else TAB lists the possible | ||
+ | completions of a device/filename.] | ||
+ | grub> '''root (hd0,0)''' | ||
+ | root (hd0,0) | ||
+ | Filesystem type is ext2fs, partition type 0x83 | ||
+ | grub> '''setup (hd0)''' | ||
+ | setup (hd0) | ||
+ | Checking if "/boot/grub/stage1" exists... no | ||
+ | Checking if "/grub/stage1" exists... yes | ||
+ | Checking if "/grub/stage2" exists... yes | ||
+ | Checking if "/grub/e2fs_stage1_5" exists... yes | ||
+ | Running "embed /grub/e2fs_stage1_5 (hd0)"... 15 sectors are embedded. | ||
+ | succeeded | ||
+ | Running "install /grub/stage1 (hd0) (hd0)1+15 p (hd0,0)/grub/stage2 /grub/grub.conf"... succeeded | ||
+ | Done. | ||
+ | grub> '''quit''' | ||
+ | |||
+ | |||
+ | Now you should be mirrored and good to go :-) |
Latest revision as of 04:24, 19 September 2018
Solaris
So here's the scenario, you have a server with disk mirrored with Solaris Disksuite, c0t0d0 and c0t1d0. The disk c0t1d0 has failed and you want to replace it without shutting down the box. We have the following metadevices and sub-mirrors:
- d0: d1 and d2
- d10: d11 and d12
- d20: d21 and d22
- d30: d31 and d32
- d40: d41 and d42
- d50: d51 and d52
This procedure had been tested and works on a SunFire V125 running Solaris 10. You may have to alter this slightly depending on the hardware you are running. Consult the User Guide for you server on how best to replace a drive on a running system.
1. Delete the meta databases stored on the failed disk, stored in this case on slice 7 of the disk
metadb -d c0t1d0s7
2. Detach the sub-mirrors, from the failed disk, from the meta devices
metadetach -f d0 d2 metadetach -f d10 d12 metadetach -f d20 d22 metadetach -f d30 d32 metadetach -f d40 d42 metadetach -f d50 d52
The -f option is necessary as you will need to force this to happen as the disk has failed.
3. Clear the meta-devices that we associated with the failed disk
metaclear d2 metaclear d12 metaclear d22 metaclear d32 metaclear d42 metaclear d52
4. Find the correct Ap_Id for the failed disk, with the cfgadm command
cfgadm -al
Ap_Id Type Receptacle Occupant Condition c0 scsi-bus connected configured unknown c0::dsk/c0t0d0 disk connected configured unknown c0::dsk/c0t1d0 disk connected configured unknown c1 scsi-bus connected unconfigured unknown usb0/1 unknown empty unconfigured ok usb0/2 unknown empty unconfigured ok
5. Unconfigure the device so that you can remove it
cfgadm -c unconfigure c0::dsk/c0t1d0
6. Check that the device is now unconfigured
cfgadm -al
Ap_Id Type Receptacle Occupant Condition c0 scsi-bus connected configured unknown c0::dsk/c0t0d0 disk connected configured unknown c0::dsk/c0t1d0 disk connected unconfigured unknown c1 scsi-bus connected unconfigured unknown usb0/1 unknown empty unconfigured ok usb0/2 unknown empty unconfigured ok
NOTE: You may get this error:
# cfgadm -c unconfigure c0::dsk/c0t0d0 cfgadm: Component system is busy, try again: failed to offline: Resource Information ------------------ ----------------------- /dev/dsk/c0t1d0s1 dump device (dedicated)
This means that the slice in question has been set as the dump device, instead of the mirror slice. You can confirm it with:
# dumpadm Dump content: kernel pages Dump device: /dev/dsk/c0t1d0s1 (dedicated) Savecore directory: /var/crash/hostname Savecore enabled: yes Save compressed: on
You fix it with:
# dumpadm -d /dev/md/dsk/d10 Dump content: kernel pages Dump device: /dev/md/dsk/d10 (swap) Savecore directory: /var/crash/hostname Savecore enabled: yes Save compressed: on
Now you should be able to unconfigure the drive.
7. Physically replace the failed drive
8. Use cfgadm to see if the OS has automatically configured the drive
cfgadm -al
If the status of the drive hasn't changed to 'configured', change it manually
cfgadm -c configure c0::dsk/c0t1d0
9. Use the format command to check that the OS can see the drive.
format
Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c0t0d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424> /pci@1c,600000/scsi@2/sd@0,0 1. c0t1d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424> /pci@1c,600000/scsi@2/sd@1,0 Specify disk (enter its number):
10. Copy the primary disks VTOC to the secondary disk:
prtvtoc /dev/rdsk/c0t0d0s2 | fmthard -s - /dev/rdsk/c0t1d0s2
11. Re-create the meta databases on the new disk
metadb -f -a -c3 /dev/dsk/c0t1d0s7
12. Re-create the meta devices on the new disk
metainit d2 1 1 c0t1d0s0 metainit d12 1 1 c0t1d0s1 metainit d22 1 1 c0t1d0s3 metainit d32 1 1 c0t1d0s4 metainit d42 1 1 c0t1d0s5 metainit d52 1 1 c0t1d0s6
13. Attach the meta devices from the new disk to the primary meta devices
metattach d0 d2 metattach d10 d12 metattach d20 d22 metattach d30 d32 metattach d40 d42 metattach d50 d52
You can monitor the progress of mirroring with this command:
metastat | grep -i progress
14. Install the boot block on the new hard disk so that you can boot off it
installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c0t1d0s0
NOTE: If you get this error with the installboot command
dd: unrecognized operand `oseek'=`1' Try `dd --help' for more information.
then it is using the wrong version of 'dd'. Change your PATH to put /usr/bin/dd first in your PATH.
CentOS/RedHat
So you have a failed disk in your software raid. When you look at /proc/mdstat it looks something like:
# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sdb1[1] sda1[2](F) 104320 blocks [2/1] [_U] md4 : active raid1 sdb2[1] sda2[2](F) 6289344 blocks [2/1] [_U] md5 : active raid1 sdb5[1] sda5[2](F) 4192832 blocks [2/1] [_U] md2 : active raid1 sdb6[1] sda6[0] 2096384 blocks [2/2] [UU] md3 : active raid1 sdb7[1] sda7[2](F) 16570944 blocks [2/1] [_U] md0 : active raid1 sdb3[1] sda3[2](F) 6289344 blocks [2/1] [_U] unused devices: <none>
This indicates that /dev/sda has failed.
To replace it, do the following:
1. Use mdstat to fail and remove each of the slices associated with the failed disk
mdadm --manage /dev/md1 --fail /dev/sda1 mdadm --manage /dev/md1 --remove /dev/sda1
/proc/mdstat should look something like this:
# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sdb1[1] 104320 blocks [2/1] [_U] md4 : active raid1 sdb2[1] sda2[2](F) 6289344 blocks [2/1] [_U] md5 : active raid1 sdb5[1] sda5[2](F) 4192832 blocks [2/1] [_U] md2 : active raid1 sdb6[1] sda6[0] 2096384 blocks [2/2] [UU] md3 : active raid1 sdb7[1] sda7[2](F) 16570944 blocks [2/1] [_U] md0 : active raid1 sdb3[1] sda3[2](F) 6289344 blocks [2/1] [_U] unused devices: <none>
Repeat for each slice of the failed disk.
/proc/mdstat should now look like this:
# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sdb1[1] 104320 blocks [2/1] [_U] md4 : active raid1 sdb2[1] 6289344 blocks [2/1] [_U] md5 : active raid1 sdb5[1] 4192832 blocks [2/1] [_U] md2 : active raid1 sdb6[1] 2096384 blocks [2/1] [_U] md3 : active raid1 sdb7[1] 16570944 blocks [2/1] [_U] md0 : active raid1 sdb3[1] 6289344 blocks [2/1] [_U] unused devices: <none>
2. Shutdown the server and replace the failed disk
shutdown -h now
3. Startup the server and boot off the good disk.
4. Copy the partition table from the good disk to the replacement disk
sfdisk -d /dev/sdb | sfdisk /dev/sda
5. To make sure that there are no remains from previous RAID installations on /dev/sda, we run the following commands:
mdadm --zero-superblock /dev/sda1 mdadm --zero-superblock /dev/sda2 mdadm --zero-superblock /dev/sda3 mdadm --zero-superblock /dev/sda5 mdadm --zero-superblock /dev/sda6 mdadm --zero-superblock /dev/sda7
If there was nothing to zero, you'll get this response:
mdadm: Unrecognised md component device - /dev/sda1
All good!
6. Add the new disks partitions to the metadevices
mdadm --add /dev/md0 /dev/sda3 mdadm --add /dev/md1 /dev/sda1 mdadm --add /dev/md2 /dev/sda6 mdadm --add /dev/md3 /dev/sda7 mdadm --add /dev/md4 /dev/sda2 mdadm --add /dev/md5 /dev/sda5
When it's finished synchronising, /proc/mdstat should look something like:
cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sda1[0] sdb1[1] 104320 blocks [2/2] [UU] md4 : active raid1 sda2[0] sdb2[1] 6289344 blocks [2/2] [UU] md5 : active raid1 sda5[0] sdb5[1] 4192832 blocks [2/2] [UU] md2 : active raid1 sda6[0] sdb6[1] 2096384 blocks [2/2] [UU] md3 : active raid1 sda7[0] sdb7[1] 16570944 blocks [2/2] [UU] md0 : active raid1 sda3[0] sdb3[1] 6289344 blocks [2/2] [UU] unused devices: <none>
7. Now install grub on the master boot record (MBR) of the new disk
grub
Probing devices to guess BIOS drives. This may take a long time. GNU GRUB version 0.97 (640K lower / 3072K upper memory) [ Minimal BASH-like line editing is supported. For the first word, TAB lists possible command completions. Anywhere else TAB lists the possible completions of a device/filename.] grub> root (hd0,0) root (hd0,0) Filesystem type is ext2fs, partition type 0x83 grub> setup (hd0) setup (hd0) Checking if "/boot/grub/stage1" exists... no Checking if "/grub/stage1" exists... yes Checking if "/grub/stage2" exists... yes Checking if "/grub/e2fs_stage1_5" exists... yes Running "embed /grub/e2fs_stage1_5 (hd0)"... 15 sectors are embedded. succeeded Running "install /grub/stage1 (hd0) (hd0)1+15 p (hd0,0)/grub/stage2 /grub/grub.conf"... succeeded Done. grub> quit
Now you should be mirrored and good to go :-)