Well, playing around with zfs send and zfs recv for the first time turned out to be a little nightmare. Most of that nightmare has nothing to do with ZFS itself. Probably 50% pebkac and 50% buggy implementation in ZFS on Linux. Actually more funny is the fact that I had first used ZFS back in 2008 and never played with zfs send and recv before.
I noticed that making a deduplicated full replication was luckily ~half in size. 2.8 GB the uncompressed stream, 1.5 GB the compressed stream, 740M the deduplicated, compressed stream (using lz4 -3). I also noticed that sending a deduplicated incremental stream leads to invalid backup stream and the impossibility to terminate the process using ctrl+c. If that wasn’t enough I am also unable to kill the process (neither -15 nor -9 work). Only a reboot will make this process disappear. A few hours later I found zfs receive fails receiving a deduplicated stream „invalid backup stream“ #2210 which references send -D -R (-I) regressions #3066. Okay.. Using deduplication only for the initial replication then.
Next I was wondering about the transfer speed. Even though there was a 1gbit/s connection nload shows a transfer rate of just 1-2 MB/s average. It would sometimes go much higher and sometimes be worse. Obviously due to that such a transfer takes .. time. Today I’ve stumbled upon Implement a receive / send buffer on zfs send | zfs receive #4281. Imagine. I haven’t tried that, yet. Though it pretty much explains the behavior I was seeing on different links. While trying to get the transfer faster and faster and faster I played around with various compression methods (you might have noticed my post about selecting a compression method for zfs send using which I was able to reduce this. Still getting per average just 33 mbit/s on a 100mbit/s link even tho it shows 98 mbit/s most of the time.
First I tried to use -R and -i just as explained here 15.2.1 Using ZFS for File System Replication. And guess what, I ran into the next problem. My auto-snapshot script creates snapshots if the property com.sun:auto-snapshot is set to true. This was the case on the source and enabled for the datasets on the destination EXCEPT for the backups dataset where I explicitly disabled it. Setting a dataset readonly
zfs recv -o readonly=on does not stop the ability to create a snapshot (that is totally okay, you just have to know that). However, the replication will set com.sun:auto-snapshot to on for the transfered dataset and hence my snapshot script was creating snapshots and hence every further incremental send would fail. Now, the documentation states that you can, apart from doing -o readonly=on unset a specific property by using -x. But, this functionality does not seem to work in ZFS on Linux (0.6.5.4) and I just created an issue-report at github (closed already, just haven’t seen that there are issues for exactly this, already) in which a few related issues have been referenced, e.g. zfs receive does neither support -o property=value nor -x property #1350. So I just added an exclude-option to my snapshot script which would exclude a given path, in this case storage/backups.
It was about time to try with a somewhat bigger Pool. Just 300 GB of data. After some hours I wanted to do
CTRL+Z so that I could follow that with
disown my usual approach to send something to the background and close the ssh session. Pretty sure you already know that I’ll show another issue at github. Here you go ‚zfs send‘ aborts on any signal #810.
Seriously, I think I’ve ran into every possible bug. I had trouble using zfs recv -F (which can be dangerous in some conditions). I think a combination of zfs rollback, zfs recv -F caused that the original dataset got renamed while trying to synchronize:
storage/backups/storage 1.07G 14.9T 140K /storage/backups/storage storage/backups/storage/images 1.07G 98.9G 1.07G /storage/backups/storage/images storage/backups/storagerecv-20913-1 55.8G 14.9T 140K /storage/backups/storagerecv-20913-1 storage/backups/storagerecv-20913-1/containers 54.6G 14.9T 151K /storage/backups/storagerecv-20913-1/containers storage/backups/storagerecv-20913-1/containers/100 10.4G 1014G 4.57G /storage/backups/storagerecv-20913-1/containers/100
It somehow renamed storage to storagerecv-20913-1 and synchronized storage/images again. I had trouble with a broken stream (found out, it was the -BD option to the
lz4 binary). However this issue initially lead me to the following github issue zfs send fails with „operation not applicable to datasets of this type“ #2268.
Instead of using -R for zfs send and -F for zfs recv I am working on a per-dataset basis now. Which solves a lot of trouble… Guess what.. Finally the replication is working fine! Just waiting for the next bug to arise! :^)