Friday, 4 January 2013

Tweaking


After playing around with the individual drive and array defaults I found that the performance can be improved substantially with a few easy tweaks.

Useful resources

I have looked long and hard for formulas in that can be used to obtain results that make sense but ultimately it comes down to testing each value and doing a benchmark test in order to 'measure' the difference it makes in your environment.

A few settings that can be adjusted are listed below. I list them in the order in which I would apply the settings during benchmarking testing, starting with the settings with the biggest impact and ending with the settings with smaller impact according to my findings.

Settings applied to the mdadm RAID array with defaults on my system (Ubuntu 12.04):

Summary of mdadm RAID Array settings
Command to apply setting Default value Tweaked value Description
Blockdev --setra 20480 /dev/md1268192 102400 Read-ahead
echo 5120 > /sys/block/md126/md/stripe_cache_size 256 5120 Stripe-cache size
echo 100000 > /sys/block/md126/md/speed_limit_max? 100 000 max speed

It is important to keep in mind that the stripe_cache_size will use a portion of RAM. For example a mdadm RAID array such as mine will use:
stripe_cache_size * block some * number of disks
=32768 * 4k * 4 (active disks)
=512MB of RAM

In my case I have 4GB of RAM and the functions performed on the machine are pretty basic so it is of little concern.

Settings applied to each drive with defaults on my system (Ubuntu 12.04)

Summary of mdadm RAID Array settings
Command to apply setting Default value Tweaked value Description
Blockdev --setra 20480 /dev/md1268192 102400 Read-ahead
echo 1 > /sys/block/sdX/queue/queue_depth31 1 NCQ Queue Depth
echo 64 > /sys/block/sdX/queue/nr_requests128 64 Nr of requests
echo deadline > /sys/block/sdX/queue/schedulerdefault noop deadline [cfq] deadline Scheduler

After hours of testing and a massive spreadsheetI have values that provide substantial performance gains. Here are some benchmark tests.



Before I apply the values persistently I will reset the values by restarting the machine..




The key benchmark with the iozone test is Stride Read, so we compare that now. 2657627 before vs 2818608 after. dd test, 150 MB/s before and 236 MB/s after.

Let's look at the bonnie output. I will spend the most energy on this as I think this is the most informative benchmark: As expected, we see that the Sequential block output is similar to the dd output. Doing dd tests have actually been redundant as the results are also contained in the bonnie output as well, but for the sake of thoroughness I did both tests.
Sequencial block input or read isn't much improved by the tweaking. This is not ideal, although it i reality unfortunately. Read and writes are a balancing act as a read performance improvement will mostly cause a reduction in write performance.
Sequential block rewrite is reading data and then writing it, so it is essentially the reading and writing performance combined. In this case, 103900 with defaults and 136714 with the tweaks in place.
Random seeks are how many random blocks bonnie can read, in this case 519 vs 404.
The +++++ means that the measurement is fast too the point where the error margin is a sizeable percentage of the measurement and the result is therefore inaccurate.



Here is a discussion on a script that tweaks the mdadm RAID array automatically. I used this as a reference although I found some of the settings mentioned here not to make a great difference.
http://ubuntuforums.org/showthread.php?t=1916607

I found that the best way to tweak was to choose some baseline value based on manual tweaking and testing and from there run through a number of values on one setting and compare them to each other. I used the following script to save some time:



In order to analyse the output I used a simple greps like below. I used screen to run the benchmark and logged all the screen output with the -L option.



Importing this into Excel and using the conditional formatting makes digging through the number easier. It is clear from the numbers below that there will not be a size fits all solution. The Sequential input and output is an example of settings that play off against each other.

Another interesting observation is the large impact the /sys/md126/md/queue/scheduler setting has on the sequential block input and output.

Also, it is useful to note that more cache isn't always better


My choice has been made and I will use this script below to configure it after reboot.



 In the next post I plan to implement these values persistently.


Tuesday, 25 December 2012

Performance before tweaking

In order to measure performance we have to be able to measure performance and understand what we are measuring. I came across this article that explains the benchmarking portion.

Benchmarking explained




bonnie++ output before any tweaking is done. I use the 8Gb size during the tests because the RAM size on the machine is 4Gb.



Whoaa!!! ~50MB/s average transfer.. Great improvement over the previous ~10MB/s. This is not a good test because the transfer rate is dependent on the source drive speed which in my case is the slowest drive (/dev/sdf if you have been following)


dd gives us write speed but uses buffering


and finally hdparm

Sunday, 23 December 2012

mdadm RAID creation

Here comes the fun part!! Let's create the RAID...

A good read on partition alignment.

chunk size = stripe size = minimum size that can be written to the disk

I wil use parted for the for the partitioning and alignment. Do the following for each drive.



And here we go:


The RAID array starts in degraded mode with a spare drive. The spare drive is added to the array at the end and the array can then perform at full speed



and then we create the new partition on the newly created device

'stride' is the amount of data written to each disk and 'stripe width' is the stride length * number of active drives. 4 in my case as the fifth drive is a parity drive. There are many calculators available, just google for on.


Remember to add the new device in fstab if you would like it to be mounted automatically.



add this line to /etc/fstab in my case:

Friday, 21 December 2012

That old chestnut: ALIGNMENT

Misaligned of partitions can cause severe performance issues. It is important to make sure that the alignment is correct before even touching mdadm. Remember, it can't be changed afterwards

Here is a good source that explains alignment with modern drives.
Linux on 4KB-sector disks: Practical advice

Another good explanation can be found here:
disk-performance-part-2-raid-layouts-and-stripe-sizing

Here is an example of a misaligned partition. The partition starts at sector 63.




Using parted we can get confirmation of the issue:



We are looking to align on multiples of the stripe size. If we select a stripe size of 512KB or in mdadm terms a chunk size of 512KB.

Note: This is not to be confused with stripe width which is derived from the stripe size, the number of load bearing disks, and the block size on the disk (4KB) We will set the stripe width when we create the partition.

Wednesday, 19 December 2012

hdparm settings

hdparm is used to get and set drive parameters



hdparm -tT on all the drives gives the following output



From the previous test we recall that sdd and sdi were the bottlenecks, but in these tests they run just fine which brings me to the conclusion that the hdparm settings have little bearing on performance. This is because UDMA is being used.

After hooooooours of tinkering and tweaking I saw little consistent improvement, so I will spend no more time on it :) The default settings are pretty good on the modern drives.



Tuesday, 18 December 2012

Find the bottleneck with iostat

Current performance before I start rebuilding:

While running bonnie++ I run iostat in an attempt to pinpoint the bottleneck:


From this it is quite clear that sdd and sdi are reaching 100% utilisation and in this case the 'await' time is quite high on the same two drives. These drives are thus the weak links in the RAID and the other drives will have to wait for the slow ones to catch up. A RAID array is only as fast as it's slowest drive.

Here is the bonnie++ output



And for reference, the same performance measures for the 500Gb WD Green OS drive:

iostat



As we would expect, the utilisation is pushed to 100% or close to that for the entire test.

bonnie++



The throughput is comparable but there is a massive difference between the latency for the RAID Array and the OS drive.


Another test is to copy a large file to the array, check iostat again to confirm consistency with the bonnie++ test:



During the rsync, the transfer rate fluctuates greatly. The range is from ~5MB/s to ~40MB/s and is therefore very inconsistent and just downright slow:



One last test with dd:



and again the results form the dd test are appalling: