Replacing a failed disk in a ZFS/ZPOOL raid array

From Peter Pap's Technowiki
Jump to: navigation, search

This has been used on a SPARC S7-2, but should be relevant to most modern sparc based servers:

1. Identify the failed disk in the zpool array:

 # zpool status
   pool: rpool
  state: DEGRADED
  
 config:
 
         NAME                       STATE      READ WRITE CKSUM
         rpool                      DEGRADED      0     0     0
           mirror-0                 DEGRADED      0     0     0
             c0t5000CCA02F613ACCd0  ONLINE        0     0     0
             c0t5000CCA02D0F6C44d0  UNAVAIL       0     0     0
 

2. Find the device path for the drive you want to remove:

 # diskinfo D:devchassis-path
 D:devchassis-path                   c:occupant-compdev
 ----------------------------------  ---------------------
 /dev/chassis/SYS/HDD0/disk          c0t5000CCA02F613ACCd0
 /dev/chassis/SYS/HDD1/disk          c0t5000CCA02D0F6C44d0
 /dev/chassis/SYS/HDD2               -
 /dev/chassis/SYS/HDD3               -
 /dev/chassis/SYS/HDD4               -
 /dev/chassis/SYS/HDD5               -
 /dev/chassis/SYS/HDD6               -
 /dev/chassis/SYS/HDD7               -
 /dev/chassis/SYS/MB/EUSB_DISK/disk  c1t0d0

3. Check the drive's status:

 # cfgadm -al
 Ap_Id                          Type         Receptacle   Occupant     Condition
 /SYS/DBP/NVME0                 unknown      empty        unconfigured unknown
 /SYS/DBP/NVME1                 unknown      empty        unconfigured unknown
 /SYS/DBP/NVME2                 unknown      empty        unconfigured unknown
 /SYS/DBP/NVME3                 unknown      empty        unconfigured unknown
 c3                             scsi-sas     connected    configured   unknown
 c3::w5000cca02f613acd,0        disk-path    connected    configured   unknown
 c4                             scsi-sas     connected    configured   unknown
 c4::w5000CCA02D0F6C45,0        disk-path    connected    configured   unknown
 usb0/1                         usb-storage  connected    configured   ok
 usb0/2                         usb-hub      connected    configured   ok

NOTICE the slight mismatch between the output of the last two commands.

4. Unconfigure the drive:

 # cfgadm -c unconfigure c4::w5000CCA02D0F6C45,0 

and check that it worked:

 # cfgadm -al
 Ap_Id                          Type         Receptacle   Occupant     Condition
 /SYS/DBP/NVME0                 unknown      empty        unconfigured unknown
 /SYS/DBP/NVME1                 unknown      empty        unconfigured unknown
 /SYS/DBP/NVME2                 unknown      empty        unconfigured unknown
 /SYS/DBP/NVME3                 unknown      empty        unconfigured unknown
 c3                             scsi-sas     connected    configured   unknown
 c3::w5000cca02f613acd,0        disk-path    connected    configured   unknown
 c4                             scsi-sas     connected    configured   unknown
 c4::w5000CCA02D0F6C45,0        disk-path    connected    unconfigured   unknown

5. Turn on the Ok to Remove indicator for that drive:

 # fmadm set-indicator /dev/chassis/SYS/HDD1/disk ok2rm on

and check that it worked:

 # fmadm get-indicator /dev/chassis/SYS/HDD1/disk ok2rm
 The indicator (ok2rm) is set to on.

6. Remove the failed drive and replace with the new one.

7. The new drive should be configured automatically, but check it anyway:

 # cfgadm -al
 Ap_Id                          Type         Receptacle   Occupant     Condition
 /SYS/DBP/NVME0                 unknown      empty        unconfigured unknown
 /SYS/DBP/NVME1                 unknown      empty        unconfigured unknown
 /SYS/DBP/NVME2                 unknown      empty        unconfigured unknown
 /SYS/DBP/NVME3                 unknown      empty        unconfigured unknown
 c3                             scsi-sas     connected    configured   unknown
 c3::w5000cca02f613acd,0        disk-path    connected    configured   unknown
 c4                             scsi-sas     connected    configured   unknown
 c4::w5000cca07d293ca1,0        disk-path    connected    configured   unknown
 usb0/1                         usb-storage  connected    configured   ok
 usb0/2                         usb-hub      connected    configured   ok

If not, configure with:

 # cfgadm -c unconfigure c4::w5000cca07d293ca1,0 

8. Get the name od the new drive from the format command:

 # format
 Searching for disks...done
 
 
 AVAILABLE DISK SELECTIONS:
        0. c0t5000CCA02F613ACCd0 <HGST-H101860SFSUN600G-A990-558.91GB>
           /scsi_vhci/disk@g5000cca02f613acc
           /dev/chassis/SYS/HDD0/disk
        1. c0t5000CCA07D293CA0d0 <HGST-H101860SFSUN600G-A990-558.91GB>
           /scsi_vhci/disk@g5000cca07d293ca0
           /dev/chassis/SYS/HDD1/disk
        2. c1t0d0 <VT-eUSB-7722-1.91GB>
           /pci@300/pci@1/pci@0/pci@2/usb@0/storage@1/disk@0,0
           /dev/chassis/SYS/MB/EUSB_DISK/disk
 Specify disk (enter its number):

9. Replace the failed disk with the new disk in the zpool:

 # zpool replace rpool c0t5000CCA02D0F6C44d0 c0t5000CCA07D293CA0d0

10. Check the progress with:

 # zpool status
   pool: rpool
  state: ONLINE
   scan: resilvered 75.8G in 8m24s with 0 errors on Wed Feb 27 11:39:38 2019
 
 config:
 
         NAME                       STATE      READ WRITE CKSUM
         rpool                      ONLINE        0     0     0
           mirror-0                 ONLINE        0     0     0
             c0t5000CCA02F613ACCd0  ONLINE        0     0     0
             c0t5000CCA07D293CA0d0  ONLINE        0     0     0
 
 errors: No known data errors

11. Once the zpool is happy again, install the boot block on the new disk:

 # installboot -f -F zfs /usr/platform/`uname -i`/lib/fs/zfs/bootblk /dev/rdsk/c0t5000CCA07D293CA0d0s0