Linux and Software-Raid

Hello, i’m using software raid in linux for some years now. Recently when playing around a bit i found out a few interesting things, which i don’t want to hide from ya. However, at some parts of my article you might lose data – So always do a backup, and look up everything in the man pages to fully understand what you’re doing – You’re doing everything written here on your own risk.

Got a bit long, this little article, and yeah, took me 4 days to write it and test everything. So.. here’s a little navigation to jump easily to everything:

Mixture of SW Raid and a Desktop Environment

With recent Kernels (last version i tested was 2.6.37) you might see weird behaviour of Software Raid with your Box at high I/O. Currently the CFQ Scheduler is the default scheduler in linux, it tries to schedule everything „fair“ so it’s goal is not to give the highest bandwidth or the best speed – This is good for a desktop system, and might be also interestingly for servers. However; Let’s say you got a Desktop System (Xorg running, two xterms and xchat opened) a Software Raid 10 with ext4 on top of it and now you’re using bonnie++ (a real old tool, maybe out of date nowadays) to benchmark the filesystem (which causes high i/o). You’ll notice at „reading getc“ some hanging of the desktop environment, jumping of the mouse. At deleting directories near the end of the test (so as soon as writes are happening) you’ll notice really high hanging (moving the mouse doesn’t work for a minute, switching to the console by alt+ctrl+f1 needs 10minutes, etc) of the box.

I tested this with two different boxes, with different hardware, with different harddiscs, with different filesystems (1: 4×3,1ghz, 4gb ram, 8gb swap, 4x500gb sata, 2: 2×2,0ghz, 3gb ram, 4gb swap, 3x160gb sata).

If you google a bit you’ll find out that other people noticed the same problem or similar problems, i can’t explain the cause, because i’m not that experienced with the linux kernels software raid implementation nor with the CFQ Scheduler – Though switching from CFQ to noop or deadline, solves the problem.

So, if you’re going to use a software raid on a desktop-machine i’m suggesting you to change the default scheduler from CFQ to Deadline. If you’re going to use the software raid on a server, you should really try whether the box starts to hang – however, i guess on a server, especially with databases, most people would use deadline anyway. The bonnie++ command used was by the way:

bonnie++ -d /some/dir/on/teh/raid -n 512

Setting up a Raid with one missing disc

Sometimes you’ll want to setup the software raid with one missing disc. For example you’ve installed a system on /dev/sdb (160gb) now you’re buying three more discs to make a software raid 10 – Just boot your system, create the raid using

mdadm --create /dev/md0 --level=10 /dev/sdb /dev/sdc /dev/sdd missing

The keyword is „missing“ – This will setup a degraded raid array. Now copy over your files from /dev/sda to your newly raid (which you formatted already) and make sure that everything is on it (backups! Yes! Really! Make a Backup Dude!).

If you made your backup and if you’re sure that everything is on the raid, the next step is to verify that you made a backup and that the backup is okay (hey.. you can’t know..) – Now you can add the last disc to the raid array by issuing

mdadm --manage /dev/md0 --add /dev/sda

This will resync the array, add the new disc first as spare (this is normal!) and as soon as it’s synced, it’ll add the spare as functional disc into the raid array (active/working device).

Did i tell you to make a backup? No? Then do it.

Real example:

(real example means, i’m just doing this while writing) I got a raid 10 over 4 discs, now i want to play a bit around with raid 5 – I have 400 GB used on my disc, so i don’t want to lose anything. What i’m doing is simple, first i’m setting one disc to fail:

umount /mnt
mdadm --manage /dev/md0 --fail /dev/sdd3
mdadm: set /dev/sdd3 faulty in /dev/md0
mdadm --manage /dev/md0 --remove /dev/sdd3
mdadm: hot removed /dev/sdd3 from /dev/md0

Now i use cfdisk to remove all partitions from /dev/sdd as there’s nothing i need in sdd1 and sdd2 – and i create one big partition which i format with mkfs.ext4. Just to make sure that the partitions are detected correctly you might want to run partprobe.

mkfs.ext4 /dev/sdd1

Now mount your raid again

mount /dev/md0 /mnt

Create a new folder:

mkdir /new

mount the new disc to it

mount /dev/sdd1 /new
mkdir /new/backup

copy over all files from the raid array to the freed disc:

cp -rva /mnt/* /new/backup/

As soon as this is finished, you can create the raid 5, note the „missing“ this is because sdb is my root device and sdd is the fourth disc which is going to be added back to the raid later:

mdadm --create /dev/md0 -n 4 -c 128 -l 5 -e 0.90 /dev/sda /dev/sdc missing /dev/sde

now create a filesystem

mkfs.ext4 -L raidarray -m 0.01 -O dir_index,extent,large_file,uninit_bg,sparse_super -b 4096 -E stride=32,stripe-width=96  /dev/md0

mount the directory and copy everything back:

mount /dev/md0 /mnt/
cp -rva /new/back/* /mnt/

Reboot again, test everything, make sure everything got copied (the reboots aren’t really neccessary however, just to make sure..) – Everything fine? Then let’s add the last disc back into the array:

mdadm --manage /dev/md0 --add /dev/sdd

you can check the following things:

root@localhost ~ # cat /proc/mdstat 
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md0 : active raid5 sdd[4] sde[3] sdc[1] sda[0]
      1465156224 blocks level 5, 128k chunk, algorithm 2 [4/3] [UU_U]
      [>....................]  recovery =  0.4% (2229604/488385408) finish=164.3min speed=49285K/sec
unused devices: 


root@localhost ~ # mdadm -D /dev/md0 
        Version : 0.90
  Creation Time : Fri Jan 21 23:15:04 2011
     Raid Level : raid5
     Array Size : 1465156224 (1397.28 GiB 1500.32 GB)
  Used Dev Size : 488385408 (465.76 GiB 500.11 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Sat Jan 22 01:52:06 2011
          State : active, degraded, recovering
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 128K

 Rebuild Status : 0% complete

           UUID : 9d9c58b9:cc143e44:04894333:532a878b (local to host localhost)
         Events : 0.15230

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       32        1      active sync   /dev/sdc
       4       8       48        2      spare rebuilding   /dev/sdd
       3       8       64        3      active sync   /dev/sde

don’t worry because of the „spare rebuilding“ – after resyncing/rebuilding it’ll be added as correct device to the array automatically. That’s all – you’re done. Just wait for the rebuilding to finish.

Using the auto-detect feature of the Kernel

I already wrote about this topic here, in short, you have to create the array with metadata version 0.90 format – Some people say you shouldn’t use the auto-detect feature anymore, however if you aren’t using it, it’s difficult to run / (root) on the raid without an initial ramdisk – So this requires some more work and i personally prefer to use the auto-detect feature.

Highering the performance

Oh. This topic is quite interesting. Raid stands for Redundant array of independent discs – Not for i’m-pro-and-i-want-the-highest-possible-1337-speed. However; To improve speed there are a few things you can do: a) resync speed b) optimizing the filesystem for the raid. c) highering the read-speed by playing around with cache d) highering the write-speed by playing around with cache – So i’m going to split this into four parts:

Improving resync-speed

Well, resyncing takes a lot of time. You can „try“ to higher the minimum speed of the resync, which might lead to a faster resync, however i noticed on some boxes that doing so might make it worse – So you’ll have to try around yourself. Example:

part of cat/proc/mdstat (not modified sync_speed_min)

      [>....................]  recovery =  0.4% (2229604/488385408) finish=164.3min speed=49285K/sec

now switching to 65000

echo 65000 > /sys/block/md0/md/sync_speed_min

part of cat /proc/mdstat

      [===>.................]  recovery = 19.8% (97161728/488385408) finish=99.2min speed=65696K/sec

now switching to 80000

echo 80000 > /sys/block/md0/md/sync_speed_min

part of cat /proc/mdstat

      [====>................]  recovery = 20.3% (99441640/488385408) finish=88.9min speed=72893K/sec

It’s not going much higher here than 72xxxK/sec, so choosing 65000 or 70000 as minimum value might be okay on my system. And of course, after changing the min_speed value you have to wait a bit, let’s say 30 seconds or something like that – after that check again. If speed is falling down a lot – decrease the setting and wait again.

Optimizing the Filesystem for a Software Raid

I won’t go much into detail here, because there are really good documents on the web explaining it better than i could. At ext3 and ext4 you can define some settings, which will make working on raid better (faster and more efficient) the keywords you’re searching for are: stride, stripe and chunk. At software-raid you can also define a chunk size. So you should look up whats good there. Best results i got with a chunk size of 128K at the raid. However, here an example:

For a raid 10 with 4 discs and a chunk size of 128K you’d use a stride size of 32 and a stripe width of 64. The chunk size of the filesystem itself should be at it’s default value (which should be 4k)

For a raid 5 with the same amount of discs and the same chunk size you’d use a stride size of 32 and a stripe width of 96

There’s by the way a script/calculator which could calculate that for you, and i’ve seen that mkfs.ext4 does this already automatically, somehow/sometimes – When i setup ext4 on raid10 it sets stride and stripe width. When i do the same with raid5 it doesn’t.. Anyway.

This is the most basic tuning, which you should give to your system. Other optimizations, which are not raid related, would be things like setting noatime when mounting the software raid and highering the commit time of the journal to 120 or something similar (Doing so will result in a datalose of around 120 seconds in case of a crash, however on the contrary the higher the amount the less busy your raid gets, the more your discs can sleep (which is then again energy efficient)) – Anyway, this is more „filesystem“ tuning than raid-tuning, so… 🙂

Highering the discs read speed

When tuning your raid – you want to make sure that your discs run at the highest possible speed. So there are a few things which you can tune

Let’s take a look at hdparm, do 5 times hdparm -tT /dev/md0 and calculate the average of the values, the result here is:

Timing cached reads: 4099 MB in 2.00 seconds = 2050.15 MB/sec
Timing buffered disk reads: 549 MB in 3.00 seconds = 182.54 MB/sec

now do the same for hdparm -tT –direct /dev/md0, my results are:

Timing O_DIRECT cached reads: 1202 MB in 2.00 seconds = 600.95 MB/sec
Timing O_DIRECT disk reads: 562 MB in 3.00 seconds = 187.11 MB/sec

Doesn’t look very bad i guess (for a software raid 10) – But what if we can jump easily a few mb’s up? again 5 averaged results each:

Timing cached reads: 4210 MB in 2.0 seconds = 2105.01 MB/sec
Timing buffered disk reads: 592 MB in 3.01 seconds = 196.56 MB/sec

It’s not doing much to –direct thus no results (thats because –direct tries to avoid caching). However, as you can see we’ve got an improvement of ca. ~3% (if i calculated correctly) at the timing cached reads and an improvement of ca. ~7% at the buffered disk reads. What i did to do so?

blockdev --setra 4096 /dev/sda
(the same for sdb, sdc, sdd, sde)
blockdev --setra 8192 /dev/md0

Whether those settings are useful for you (especially the first one) and whether you might want to use 4096 instead of 8192 also, you have to check yourself 🙂

Highering write speed

Little note before: In this part i’m showing you about stripe_cache_size – This setting seems to be only available using software raid 5. So if you got software raid 10 you can’t tune that setting (at least i havent found out how)

To optimize the write speed, we need to know what the write speed is currently. So let’s check with 3 dd’s and average afterwards. To have good results you want to run the dd test directly onto the md0 device, but be aware, BY DOING SO YOURE GOING TO LOSE ALL DATA ON THAT DEVICE – Thus i’m doing it on the filesystem (mount /dev/md0 create a folder on it for testing and start dd as i explain in the next step).

  • you want to make that as high as double of your ram so if you got 4GB ram, write an 8 GB file
  • play around with different bs – start with 1024, then go to 4096 and just out of fun use 8192.
  • when you change the bs value, you have to change count also, while a count of 8000000 will create a 8 GB file when you use bs of 1024, a bs of 2048 will want 4000000 as count. So 4096 will want 2000000

Here are the results of my raid 10:

Test 1 – bs 1024 (count: 8000000)

Averaged result: 161+160+161 / 3 = 160.66 MB/s
51.028+51.1632+50.7968 / 3 = 50.99 seconds

Test 2 – bs 2048 (count: 4000000)

Averaged result: 159+157+156/3 = 157.33 MB/s
51.5316+52.2847+52.4908/3 = 52.10 seconds

Test 3 – bs 4096 (count: 2000000)

Averaged result: 157+156+155/3 = 156 MB/s
52.2935+52.3563+52.8737/3 = 52.50 seconds

Test 4 – bs 8192 (count: 1000000)

Averaged result: 153+151+151/3 = 151.66 MB/s
53.5324+54.1206+54.1253/3 = 53.92 seconds

Now let’s Average the written bandwidth of the four tests: (160.66+157.33+156+151.66)/4=156.41 MB/s – So we know, we can write to the raid array with around 150 MB per second.

However; let’s look at the raid5:
Test 1 – bs 1024 (count: 8000000)

57 seconds
143 mb/s

Test 2 – bs 2048 (count: 4000000)

55 seconds
147 mb/s

Test 3 – bs 4096 (count: 2000000)

56 seconds
144 mb/s

Test 4 – bs 8192 (count: 1000000)


Averaged raid write speed is thus: (143+147+144+)/4 = 144 mb/s . The interesting thing contrary to the raid10 result is that here the speed is not dopping down when highering the bs size from 1024 to 8192.

Enough of benchmarking and playing around with numbers, or? Let’s take a look at this:

root@localhost ~ # cat /sys/block/md0/md/stripe_cache_active 
root@localhost ~ # cat /sys/block/md0/md/stripe_cache_size   

Let’s higher it to some reasonable value, let’s say 8192. And let’s run the benchmark again 🙂 While doing take a look at:

wdp@localhost ~ $ cat /sys/block/md0/md/stripe_cache_active 
wdp@localhost ~ $ cat /sys/block/md0/md/stripe_cache_active 

and notice that free -m will use a lot of your ram. And then be impressed by:

root@localhost ~ # dd if=/dev/zero of=/mnt/test/testfile bs=1024 count=8000000
8000000+0 records in
8000000+0 records out
8192000000 bytes (8.2 GB) copied, 36.4821 s, 225 MB/s

Let’s test like the first tests to verify:

Test 1 – bs 1024 (count: 8000000)

36 seconds
224 mb/s

Test 2 – bs 2048 (count: 4000000)

37 seconds
220 mb/s

Test 3 – bs 4096 (count: 2000000)

37 seconds
217 mb/s

Test 4 – bs 8192 (count: 1000000)

37 seconds
218 mb/s

Averaged write speed: (224+220+217+218)/4 = 219 mb/s

From ~144 mb/s to ~219 mb/s just with one little adjusted setting.. Any Questions? Oh. And did you noticed, that the „tuned“ software raid 5 got a higher bandwidth than the not-tuned software raid 10? And yes, obviously i got too much time on weekends.

No Comments

Post a Comment