The company i work for asked me to write a document about how to use DRBD and OCFS2 for MySQL Clustering as replacement for the MySQL NDB Stuff. This Document is nearly finished, however, today i talked with a friend, who’s working a lot with oracle databases and asked him what tests he would do to test the locking characteristics of the Filesystem. He told me this can be made using dd in 4 Tests:
- “dd if=/dev/zero of=/mnt/filename bs=100M count=100″ on Node1 on the Clusterfilesystem
- The same on the other node (Node2), the times should be equal.
- Two times (2x) the same on Node1, the time will break in
- Now on both nodes at the same time – The time should be higher, the higher the worse.
So, i did this checks on the System i’ve setup with DRBD and OCFS2, here are the results:
Check 1
ER-40021:/var/lib/mysql# time dd if=/dev/zero of=/var/lib/mysql/test bs=100M count=10 1048576000 bytes (1.0 GB) copied, 88.1202 s, 11.9 MB/s real 1m29.530s user 0m0.000s sys 0m3.496s
Check 2
ER-20026:/var/lib/mysql# time dd if=/dev/zero of=/var/lib/mysql/test bs=100M count=10 1048576000 bytes (1.0 GB) copied, 84.3576 s, 12.4 MB/s real 1m24.622s user 0m0.000s sys 0m5.408s
Check 3
ER-40021:/var/lib/mysql# time dd if=/dev/zero of=/var/lib/mysql/test bs=100M count=10 & [1] 5252 ER-40021:/var/lib/mysql# time dd if=/dev/zero of=/var/lib/mysql/test bs=100M count=10 1048576000 bytes (1.0 GB) copied, 84.1481 s, 12.5 MB/s real 1m24.314s user 0m0.000s sys 0m2.964s 1048576000 bytes (1.0 GB) copied, 137.668 s, 7.6 MB/s [1]+ Done time dd if=/dev/zero of=/var/lib/mysql/test bs=100M count=10 real 2m19.164s user 0m0.000s sys 0m5.364s
Check 4
Node 1:
ER-40021:/var/lib/mysql# time dd if=/dev/zero of=/var/lib/mysql/test bs=100M count=10
1048576000 bytes (1.0 GB) copied, 156.417 s, 6.7 MB/s
real 2m36.583s
user 0m0.004s
sys 0m2.512s
Node 2:
ER-20026:/var/lib/mysql# time dd if=/dev/zero of=/var/lib/mysql/test bs=100M count=10
1048576000 bytes (1.0 GB) copied, 173.982 s, 6.0 MB/s
real 3m24.200s
user 0m0.000s
sys 0m3.372s
My friend was interested into the %CPU and %WAIT Data of the nodes during test 3 and 4, so i will redo the tests now. As you can see above, i used 1 GB instead of 10 GB. My next test will use 2 GB:
The Box(es)
- 2 GB Ram
- Intel(R) Core(TM)2 Duo CPU E7200 @ 2.53GHz
- 160 GB Sata (Seagate Barracuda 7200.10 family, ST3160215AS)
- Debian Lenny
top - 22:54:22 up 1 day, 7:04, 1 user, load average: 0.08, 0.15, 0.11 Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Check 1
dd:
ER-40021:/var/lib/mysql# time dd if=/dev/zero of=/var/lib/mysql/test bs=100M count=20
2097152000 bytes (2.1 GB) copied, 187.962 s, 11.2 MB/s
real 3m8.043s
user 0m0.000s
sys 0m9.937s
Top after 30 seconds:
top - 22:56:31 up 1 day, 7:07, 1 user, load average: 1.86, 0.63, 0.27
Cpu0 : 0.0%us, 1.6%sy, 0.0%ni, 66.3%id, 32.1%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 8.9%sy, 0.0%ni, 18.4%id, 66.6%wa, 1.3%hi, 4.7%si, 0.0%st
Check 2
dd:
2097152000 bytes (2.1 GB) copied, 178.706 s, 11.7 MB/s
real 2m59.232s
user 0m0.000s
sys 0m5.836s
Top after 30 seconds:
ER-40021:/var/lib/mysql# time dd if=/dev/zero of=/var/lib/mysql/test bs=100M count=20
top - 22:38:37 up 1 day, 7:10, 1 user, load average: 1.97, 0.53, 0.31
Cpu0 : 0.0%us, 4.0%sy, 0.0%ni, 30.2%id, 61.1%wa, 1.7%hi, 3.0%si, 0.0%st
Cpu1 : 0.0%us, 9.3%sy, 0.0%ni, 9.6%id, 78.0%wa, 0.7%hi, 2.4%si, 0.0%st
Check 3
dd:
ER-20026:/var/lib/mysql# time dd if=/dev/zero of=/var/lib/mysql/test bs=100M count=20 &
ER-20026:/var/lib/mysql# time dd if=/dev/zero of=/var/lib/mysql/test bs=100M count=20 &
2097152000 bytes (2.1 GB) copied, 205.511 s, 10.2 MB/s
real 3m25.861s
user 0m0.000s
sys 0m6.152s
2097152000 bytes (2.1 GB) copied, 205.622 s, 10.2 MB/s
real 3m25.692s
user 0m0.000s
sys 0m3.500s
[1]- Done time dd if=/dev/zero of=/var/lib/mysql/test bs=100M count=20
[2]+ Done time dd if=/dev/zero of=/var/lib/mysql/test bs=100M count=20
Top after 30 seconds:
top - 22:44:15 up 1 day, 7:16, 1 user, load average: 2.29, 1.41, 0.80
Cpu0 : 0.0%us, 2.7%sy, 0.0%ni, 3.7%id, 89.0%wa, 0.7%hi, 4.0%si, 0.0%st
Cpu1 : 0.0%us, 2.1%sy, 0.0%ni, 92.8%id, 0.4%wa, 0.8%hi, 3.8%si, 0.0%st
Check 4
Node1:
dd:
2097152000 bytes (2.1 GB) copied, 410.541 s, 5.1 MB/s
real 6m50.703s
user 0m0.008s
sys 0m3.448s
Top after 30 seconds:
top - 23:15:07 up 1 day, 7:25, 1 user, load average: 1.19, 0.40, 0.37
Cpu0 : 0.0%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu1 : 0.0%us, 12.9%sy, 0.0%ni, 79.6%id, 4.2%wa, 0.8%hi, 2.5%si, 0.0%st
Node2:
dd:
2097152000 bytes (2.1 GB) copied, 400.627 s, 5.2 MB/s
real 6m40.983s
user 0m0.000s
sys 0m7.560s
Top after 30 seconds:
top - 22:53:29 up 1 day, 7:25, 1 user, load average: 2.03, 1.42, 1.19
Cpu0 : 0.0%us, 2.0%sy, 0.0%ni, 88.7%id, 6.0%wa, 0.3%hi, 3.0%si, 0.0%st
Cpu1 : 0.0%us, 4.4%sy, 0.0%ni, 67.4%id, 25.8%wa, 0.3%hi, 2.0%si, 0.0%st
As you can see, the Wait is quite high, this might be due to too slow discs or most likely the network. Both boxes are connected with each other over a 100mbit/s NIC so the datatransfer between those boxes is limited to technically 12MB/s (practically 10MB/s). However, i’ve shown this data to my friend and his comments were:
What we can see here is a mixture of two concurrency problems. Firsthand, it’s hard disk internal concurrency, the write head has to jump back and forth to write the different positions, of course. It’s the usual IOPS problem of magnetic disks.
The other problem here is concurrency between the nodes, two in this case. Changing blocks within the same file in the same directory and changing blocks that may still be in the file system cache of the opposite node – all that is stuff the cluster file system has to manage. It has to provide integrity, but should avoid locks whenever possible.
Now about the numbers of the 2GB test series. Tests 1 and 2 are desired to find the absolute range of IO power we are talking about. Test 3 tells us, how much the disk-internal concurrency affects our result, and Test 4 is what we are looking for.
You can see, Test 1+2 are 11MB/s each, if started concurrently on one node, it’s still 10MB/s but with about 90% of IO wait of the CPU. The high wait rate of Test 3 means, that the media is about to go in a saturation state now, luckily Jean caught something near the break even point with his test.
In preparation for Test 4, this tells us: All degradation in speed we can see from now on, is made by the cluster FS, not by the disk. Despite we have much less IO speed in Test 4 (5 MB/s vs. 10MB/s in reference), there is much less WAIT% now. WAIT% is “waiting for IO” – unlocking mechanisms my OCFS2 via network are causing slight SYS% increase instead. One last point of interest is to look for the limitation now – the values don’t provide it.
My guess is Networking: Bandwidth, or more likely: Latency.
Thanks to Martin (usn) for his time :-)

uhm what does zfs have to do with it. its not a cluster filesystem. whether it’s good or bad is irrelevant to a comparison of zfs with ocfs2 or gfs2. they’re different beasts.
also I am not sure if your tests make much sense. all you end up doing is really test disk io – and not test the filesystem itself actually in this case network stack. it would be more interesting to do more complex operations like doing a kernel build in parallel (2 separate trees on each node) or something like that then you’re actually testing a filesystem.
In general this does not have anything to do with ZFS. I just named ZFS because it’s possible (with some hacks) to use it over Network for similar setups. And right, it’s not a shared fs. i removed my little text about it :)
Also, please don’t misunderstand this tests, it’s for testing the locking characteristics, nothing else. I don’t think a Kernel Build in parallel would bring me to the same results, because a kernel build isn’t writing to disk that fast and not that much. If i wanted to test the whole filesystem, there would have been some more and different checks.