Replacing a failed disk in a software mirror

From Peter Pap's Technowiki
Revision as of 03:15, 13 January 2011 by Ppapa (talk | contribs)

Jump to: navigation, search

Solaris

So here's the scenario, you have a server with disk mirrored with Solaris Disksuite, c0t0d0 and c0t1d0. The disk c0t1d0 has failed and you want to replace it without shutting down the box. We have the following metadevices and sub-mirrors:

  • d0: d1 and d2
  • d10: d11 and d12
  • d20: d21 and d22
  • d30: d31 and d32
  • d40: d41 and d42
  • d50: d51 and d52

This procedure had been tested and works on a SunFire V125 running Solaris 10. You may have to alter this slightly depending on the hardware you are running. Consult the User Guide for you server on how best to replace a drive on a running system.

1. Delete the meta databases stored on the failed disk, stored in this case on slice 7 of the disk

 metadb -d c0t1d0s7

2. Detach the sub-mirrors, from the failed disk, from the meta devices

 metadetach -f d0 d2
 metadetach -f d10 d12
 metadetach -f d20 d22
 metadetach -f d30 d32
 metadetach -f d40 d42
 metadetach -f d50 d52

The -f option is necessary as you will need to force this to happen as the disk has failed.

3. Clear the meta-devices that we associated with the failed disk

 metaclear d2
 metaclear d12
 metaclear d22
 metaclear d32
 metaclear d42
 metaclear d52

4. Find the correct Ap_Id for the failed disk, with the cfgadm command

 cfgadm -al
 Ap_Id                          Type         Receptacle   Occupant     Condition
 c0                             scsi-bus     connected    configured   unknown
 c0::dsk/c0t0d0                 disk         connected    configured   unknown
 c0::dsk/c0t1d0                 disk         connected    configured   unknown
 c1                             scsi-bus     connected    unconfigured unknown
 usb0/1                         unknown      empty        unconfigured ok
 usb0/2                         unknown      empty        unconfigured ok

5. Unconfigure the device so that you can remove it

 cfgadm -c unconfigure c0::dsk/c0t1d0

6. Check that the device is now unconfigured

 cfgadm -al
 Ap_Id                          Type         Receptacle   Occupant     Condition
 c0                             scsi-bus     connected    configured   unknown
 c0::dsk/c0t0d0                 disk         connected    configured   unknown
 c0::dsk/c0t1d0                 disk         connected    unconfigured unknown
 c1                             scsi-bus     connected    unconfigured unknown
 usb0/1                         unknown      empty        unconfigured ok
 usb0/2                         unknown      empty        unconfigured ok

7. Physically replace the failed drive

8. Use cfgadm to see if the OS has automatically configured the drive

 cfgadm -al

If the status of the drive hasn't changed to 'configured', change it manually

 cfgadm -c configure c0::dsk/c0t1d0

9. Use the format command to check that the OS can see the drive.

 format
 Searching for disks...done
 
 
 AVAILABLE DISK SELECTIONS:
        0. c0t0d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
           /pci@1c,600000/scsi@2/sd@0,0
        1. c0t1d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
           /pci@1c,600000/scsi@2/sd@1,0
 Specify disk (enter its number):

10. Copy the primary disks VTOC to the secondary disk:

 prtvtoc /dev/rdsk/c0t0d0s2 | fmthard -s - /dev/rdsk/c0t1d0s2

11. Re-create the meta databases on the new disk

 metadb -f -a -c3 /dev/dsk/c0t1d0s7

12. Re-create the meta devices on the new disk

 metainit d2 1 1 c0t1d0s0
 metainit d12 1 1 c0t1d0s1
 metainit d22 1 1 c0t1d0s3
 metainit d32 1 1 c0t1d0s4
 metainit d42 1 1 c0t1d0s5
 metainit d52 1 1 c0t1d0s6

13. Attach the meta devices from the new disk to the primary meta devices

 metattach d0 d2
 metattach d10 d12
 metattach d20 d22
 metattach d30 d32
 metattach d40 d42
 metattach d50 d52

You can monitor the progress of mirroring with this command:

 metastat | grep -i progress

14. Install the boot block on the new hard disk so that you can boot off it

 installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c0t1d0s0

NOTE: If you get this error with the installboot command

 dd: unrecognized operand `oseek'=`1' Try `dd --help' for more information.

then it is using the wrong version of 'dd'. Change your PATH to put /usr/bin/dd first in your PATH.


CentOS/RedHat Linux