Upgraded a zfs-based storage (zfs on linux) from ~18 TiB usable-capacity to ~36 TiB usable-capacity. Had to re-create the pool for that. Here is how I did that…
First of all I removed the mirrored ZIL pair and the Hotspare just to make sure neither would cause any trouble:
root@storage:~# zpool status pool: storage state: ONLINE scan: scrub repaired 0 in 0h0m with 0 errors on Fri Jan 15 14:14:53 2016 config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 WD-WCC4N2xx ONLINE 0 0 0 SG-W6A12Gxx ONLINE 0 0 0 WD-WCC4N6xx ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 SG-W6A12Fxx ONLINE 0 0 0 WD-WCC4N6xx ONLINE 0 0 0 SG-W6A12Fxx ONLINE 0 0 0 raidz1-2 ONLINE 0 0 0 SG-W6A12Gxx ONLINE 0 0 0 WD-WCC4N6xx ONLINE 0 0 0 SG-W6A12Fxx ONLINE 0 0 0 logs mirror-3 ONLINE 0 0 0 zil1 ONLINE 0 0 0 zil2 ONLINE 0 0 0 cache cache1 ONLINE 0 0 0 cache2 ONLINE 0 0 0 spares WD-WCC4N4xx AVAIL errors: No known data errors
My first try to remove the ZIL:
root@storage:~# zpool remove storage zil1 cannot remove zil1: no such device in pool root@storage:~# zpool remove storage zil2 cannot remove zil2: no such device in pool
Ah. I have to use the full path.
root@storage:~# zpool remove storage /dev/disk/by-partlabel/zil1 cannot remove /dev/disk/by-partlabel/zil1: operation not supported on this type of pool root@storage:~# zpool remove storage /dev/disk/by-partlabel/zil2 cannot remove /dev/disk/by-partlabel/zil2: operation not supported on this type of pool
Now I forgot something specific about the zil-device. If you have a mirrored-pair for ZIL you need to remove the mirror (the vdev) instead of the physical attached discs:
root@storage:~# zpool remove storage mirror-3
Now let me get rid of the hot spare:
root@storage:~# zpool remove storage /dev/disk/by-partlabel/WD-WCC4N4xx
Right after a zpool scrub storage
the pool looks like that:
root@storage:~# zpool status pool: storage state: ONLINE scan: scrub repaired 0 in 1h1m with 0 errors on Wed Jan 27 18:38:57 2016 config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 WD-WCC4N2xx ONLINE 0 0 0 SG-W6A12Gxx ONLINE 0 0 0 WD-WCC4N6xx ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 SG-W6A12Fxx ONLINE 0 0 0 WD-WCC4N6xx ONLINE 0 0 0 SG-W6A12Fxx ONLINE 0 0 0 raidz1-2 ONLINE 0 0 0 SG-W6A12Gxx ONLINE 0 0 0 WD-WCC4N6xx ONLINE 0 0 0 SG-W6A12Fxx ONLINE 0 0 0 cache cache1 ONLINE 0 0 0 cache2 ONLINE 0 0 0 errors: No known data errors
The current pool consists of three raidz1 vdevs with three 3TB discs each. You might assume that this would lead to 18 TB usable space.. However, after alignment, internal reservation, TiB/TB miscalculation, there’s only 14,7 TB usable data left. Anyway. The new pool should consist of 4 raidz1 consisting of 4x 3TB. Instead of using two vdevs consisting of 8 discs I decided for the four vdev variant due to the higher IOPS.
First of all, I am picking one of the new discs to save a replication of the pool there. Just to make sure I am also saving a .tar.gz on that disc containing all the data. So cfdisk /dev/sdk
followed by partprobe
followed by mkfs.ext4 /dev/sdk1
and the final mount /dev/sdk1 /mnt
.
Now let’s create a replication stream stored compressed:
root@storage:~# zfs snapshot -r storage@backup root@storage:~# zfs send -R storage@backup | pigz > /mnt/backup.gz
This will take a while. You can save some time if you have enough space and your disc is fast enough by not using compression or using lz4. If you do have spare backups somewhere else, you most likely do not need this step. Once finished I destroy the original pool, so that I can create a new one:
root@storage:~# zpool destroy storage
As explained in my previous posts, I am preparing the new discs (except for sdk which holds my backup) by using a GPT partition table, creating one partition and adding the serial of the disc as partition name. Type is 39 (solaris root). That makes it easy for me to identify a failing disc.
First I need to create a fake-disk because zfs does not know a keyword like „missing“:
root@storage:~# dd if=/dev/zero of=/tmp/fakedisk bs=1 count=0 seek=3T 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.00013862 s, 0.0 kB/s
Then let’s create the Pool:
zpool create -o ashift=12 storage \ raidz1 /dev/disk/by-partlabel/WD-WCC4N2xx /dev/disk/by-partlabel/SG-W6A12Gxx \ /dev/disk/by-partlabel/WD-WCC4N6xx /dev/disk/by-partlabel/SG-Z5020Fxx \ raidz1 /dev/disk/by-partlabel/WD-WCC4N6xx /dev/disk/by-partlabel/SG-W6A12Fxx \ /dev/disk/by-partlabel/WD-WCC4N4xx /dev/disk/by-partlabel/SG-W6A12Fxx \ raidz1 /dev/disk/by-partlabel/SG-W6A12Gxx /dev/disk/by-partlabel/WD-WCC4N6xx \ /dev/disk/by-partlabel/SG-W6A12Fxx /tmp/fakedisk \ raidz1 /dev/disk/by-partlabel/WD-WCAWZ2xx /dev/disk/by-partlabel/SG-Z501ZYxx \ /dev/disk/by-partlabel/WD-WCAWZ2xx /dev/disk/by-partlabel/SG-Z5020Gxx \ log mirror /dev/disk/by-partlabel/zil1 /dev/disk/by-partlabel/zil2 \ cache /dev/disk/by-partlabel/cache1 /dev/disk/by-partlabel/cache2
Now let’s kill the fake disk:
zpool offline storage /tmp/fakedisk rm -rf /tmp/fakedisk zpool scrub storage
Now let’s copy the data back
root@storage:/mnt# unpigz backup.gz | zfs recv -F storage@backup
This will take a while again… Okay.. It took too long (or I made something wrong) hence ctrl+c re-created the datasets and unpacked the .tar.gz I’ve created .. 🙂 I am so impatient sometimes.. bad habit. Anyway. It is time to prepare the last drive and replace our fake-device with the remaining disc. Currently the pool looks like this:
root@storage:~# zpool status pool: storage state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: scrub repaired 0 in 0h0m with 0 errors on Thu Jan 28 11:52:34 2016 config: NAME STATE READ WRITE CKSUM storage DEGRADED 0 0 0 raidz1-0 ONLINE 0 0 0 WD-WCC4N2xx ONLINE 0 0 0 SG-W6A12Gxx ONLINE 0 0 0 WD-WCC4N6xx ONLINE 0 0 0 SG-Z5020Fxx ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 WD-WCC4N6xx ONLINE 0 0 0 SG-W6A12Fxx ONLINE 0 0 0 WD-WCC4N4xx ONLINE 0 0 0 SG-W6A12Fxx ONLINE 0 0 0 raidz1-2 DEGRADED 0 0 0 SG-W6A12Gxx ONLINE 0 0 0 WD-WCC4N6xx ONLINE 0 0 0 SG-W6A12Fxx ONLINE 0 0 0 /tmp/fakedisk OFFLINE 0 0 0 raidz1-3 ONLINE 0 0 0 WD-WCAWZ2xx ONLINE 0 0 0 SG-Z501ZYxx ONLINE 0 0 0 WD-WCAWZ2xx ONLINE 0 0 0 SG-Z5020Gxx ONLINE 0 0 0 logs mirror-4 ONLINE 0 0 0 zil1 ONLINE 0 0 0 zil2 ONLINE 0 0 0 cache cache1 ONLINE 0 0 0 cache2 ONLINE 0 0 0 errors: No known data errors root@storage:~# zpool iostat -v capacity operations bandwidth pool alloc free read write read write ------------------- ----- ----- ----- ----- ----- ----- storage 285G 43.2T 1 257 2.62K 24.4M raidz1 71.3G 10.8T 0 64 705 6.11M WD-WCC4N2xx - - 0 30 627 2.13M SG-W6A12Gxx - - 0 27 584 2.13M WD-WCC4N6xx - - 0 30 569 2.13M SG-Z5020Fxx - - 0 27 593 2.13M raidz1 71.3G 10.8T 0 64 668 6.11M WD-WCC4N6xx - - 0 30 602 2.13M SG-W6A12Fxx - - 0 27 539 2.13M WD-WCC4N4xx - - 0 30 596 2.13M SG-W6A12Fxx - - 0 27 590 2.13M raidz1 71.3G 10.8T 0 64 671 6.11M SG-W6A12Gxx - - 0 29 583 2.13M WD-WCC4N6xx - - 0 28 614 2.13M SG-W6A12Fxx - - 0 29 1003 2.13M /tmp/fakedisk - - 0 0 0 0 raidz1 71.3G 10.8T 0 64 641 6.11M WD-WCAWZ2xx - - 0 29 546 2.13M SG-Z501ZYxx - - 0 27 585 2.13M WD-WCAWZ2xx - - 0 29 554 2.13M SG-Z5020Gxx - - 0 27 562 2.13M logs - - - - - - mirror 0 15.2G 0 0 0 0 zil1 - - 0 0 168 114 zil2 - - 0 0 168 114 cache - - - - - - cache1 68.5G 95.4G 0 64 341 8.05M cache2 68.5G 95.4G 0 64 332 8.04M ------------------- ----- ----- ----- ----- ----- -----
First try to replace the disc (changed the partition type to solaris root, added the partition name and issued partprobe
)
root@storage:~# zpool replace storage /tmp/fakedisk /dev/disk/by-partlabel/SG-Z5020Gxx invalid vdev specification use '-f' to override the following errors: /dev/disk/by-partlabel/SG-Z5020Gxx contains a filesystem of type 'ext4'
pfft…
root@storage:~# zpool replace storage -f /tmp/fakedisk /dev/disk/by-partlabel/SG-Z5020Gxx
Now let’s take a look at the pool:
root@storage:~# zpool status pool: storage state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Thu Jan 28 19:41:48 2016 85.0G scanned out of 1.07T at 750M/s, 0h23m to go 5.06G resilvered, 7.75% done config: NAME STATE READ WRITE CKSUM storage DEGRADED 0 0 0 raidz1-0 ONLINE 0 0 0 WD-WCC4N2xx ONLINE 0 0 0 SG-W6A12Gxx ONLINE 0 0 0 WD-WCC4N6xx ONLINE 0 0 0 SG-Z5020Fxx ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 WD-WCC4N6xx ONLINE 0 0 0 SG-W6A12Fxx ONLINE 0 0 0 WD-WCC4N4xx ONLINE 0 0 0 SG-W6A12Fxx ONLINE 0 0 0 raidz1-2 DEGRADED 0 0 0 SG-W6A12Gxx ONLINE 0 0 0 WD-WCC4N6xx ONLINE 0 0 0 SG-W6A12Fxx ONLINE 0 0 0 replacing-3 OFFLINE 0 0 0 /tmp/fakedisk OFFLINE 0 0 0 SG-Z5020Gxx ONLINE 0 0 0 (resilvering) raidz1-3 ONLINE 0 0 0 WD-WCAWZ2xx ONLINE 0 0 0 SG-Z501ZYxx ONLINE 0 0 0 WD-WCAWZ2xx ONLINE 0 0 0 SG-Z5020Gxx ONLINE 0 0 0 logs mirror-4 ONLINE 0 0 0 zil1 ONLINE 0 0 0 zil2 ONLINE 0 0 0 cache cache1 ONLINE 0 0 0 cache2 ONLINE 0 0 0 errors: No known data errors
Oh my! Resilvering at ~750 MB/s? Nice. However, I am unable to explain this bandwidth. zpool iostat 1
jumps between different values like 26 and 83 MB/s and iostat -dxh
shows more like 95 MB/s if I sum up writes and reads in KB. Because a single disc can only do something like 95-150 MB/s (writing) those 750 MB/s are probably some summed up value. 20 minutes later the pool was back. Just to make sure, a scrub:
root@storage:~# zpool status pool: storage state: ONLINE scan: scrub in progress since Thu Jan 28 21:12:03 2016 115G scanned out of 1.07T at 1.47G/s, 0h11m to go 0 repaired, 10.49% done config: NAME STATE READ WRITE CKSUM storage ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 WD-WCC4N2xx ONLINE 0 0 0 SG-W6A12Gxx ONLINE 0 0 0 WD-WCC4N6xx ONLINE 0 0 0 SG-Z5020Fxx ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 WD-WCC4N6xx ONLINE 0 0 0 SG-W6A12Fxx ONLINE 0 0 0 WD-WCC4N4xx ONLINE 0 0 0 SG-W6A12Fxx ONLINE 0 0 0 raidz1-2 ONLINE 0 0 0 SG-W6A12Gxx ONLINE 0 0 0 WD-WCC4N6xx ONLINE 0 0 0 SG-W6A12Fxx ONLINE 0 0 0 SG-Z5020Gxx ONLINE 0 0 0 raidz1-3 ONLINE 0 0 0 WD-WCAWZ2xx ONLINE 0 0 0 SG-Z501ZYxx ONLINE 0 0 0 WD-WCAWZ2xx ONLINE 0 0 0 SG-Z5020Gxx ONLINE 0 0 0 logs mirror-4 ONLINE 0 0 0 zil1 ONLINE 0 0 0 zil2 ONLINE 0 0 0 cache cache1 ONLINE 0 0 0 cache2 ONLINE 0 0 0 errors: No known data errors
No Comments