Replacing a failed disk in a software mirror

From Peter Pap's Technowiki
Revision as of 23:48, 14 November 2010 by Ppapa (talk | contribs)

Jump to: navigation, search

So here's the scenario, you have a server with disk mirrored with Solaris Disksuite, c0t0d0 and c0t1d0. The disk c0t1d0 has failed and you want to replace it without shutting down the box. We have the following metadevices and sub-mirrors:

d0: d1 and d2 d10: d11 and d12 d20: d21 and d22 d30: d31 and d32 d40: d41 and d42 d50: d51 and d52

This procedure had been tested and works on a Sunfire V120. You may have to alter this slightly depending on the hardware you are running. Consult the User Guide for you server on how best to replace a drive on a running system.

1. Delete the meta databases stored on the failed disk, stored in this case on slice 7 of the disk

 metadb -d c0t1d0s7

2. Detach the sub-mirrors, from the failed disk, from the meta devices

 metadetach -f d0 d2
 metadetach -f d10 d12
 metadetach -f d20 d22
 metadetach -f d30 d32
 metadetach -f d40 d42
 metadetach -f d50 d52

The -f option is necessary as you will need to force this to happen as the disk has failed.

3. Clear the meta-devices that we associated with the failed disk

 metaclear d2
 metaclear d12
 metaclear d22
 metaclear d32
 metaclear d42
 metaclear d52

4. Find the correct Ap_Id for the failed disk, with the cfgadm command

 cfgadm -al
 Ap_Id                          Type         Receptacle   Occupant     Condition
 c0                             scsi-bus     connected    configured   unknown
 c0::dsk/c0t0d0                 disk         connected    configured   unknown
 c0::dsk/c0t1d0                 disk         connected    configured   unknown
 c1                             scsi-bus     connected    unconfigured unknown
 usb0/1                         unknown      empty        unconfigured ok
 usb0/2                         unknown      empty        unconfigured ok

5. Unconfigure the device so that you can remove it

 cfgadm -c unconfigure c0::dsk/c0t1d0

6. Check that the device is now unconfigured

 cfgadm -al
 Ap_Id                          Type         Receptacle   Occupant     Condition
 c0                             scsi-bus     connected    configured   unknown
 c0::dsk/c0t0d0                 disk         connected    configured   unknown
 c0::dsk/c0t1d0                 disk         connected    unconfigured unknown
 c1                             scsi-bus     connected    unconfigured unknown
 usb0/1                         unknown      empty        unconfigured ok
 usb0/2                         unknown      empty        unconfigured ok

7. Physically replace the failed drive

8. Use cfgadm to see if the OS has automatically configured the drive

 cfgadm -al

If the status of the drive hasn't changed to 'configured', change it manually

 cfgadm -c configure c0::dsk/c0t1d0

9. Use the format command to check that the OS can see the drive.

 format
 Searching for disks...done
 
 
 AVAILABLE DISK SELECTIONS:
        0. c0t0d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
           /pci@1c,600000/scsi@2/sd@0,0
        1. c0t1d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
           /pci@1c,600000/scsi@2/sd@1,0
 Specify disk (enter its number):

10.