Parity-based redundancy (RAID5/6/triple parity and beyond) on BTRFS and MDADM (Dec 2014)

While reading the following posts (second post) from Adam Leventhal

I have wondered what is exactly the status of parity-based redundancy in general and triple-parity in particular on Linux in MDADM and BTRFS.

TL;DR: There are patches to extend the linux kernel to support up to 6 parity disks but BTRFS does not want them because it does not fit their “business case” and MDADM would want them but somebody needs to develop patches for the MDADM component. The kernel raid implementation is ready and usable. If someone volunteers to do this kind of work i would support with equipment and myself as a test ressource.

BTRFS has some preliminary support for RAID5/6 but it is completely unusable in it´s current state. Some highlights:

  • As of kernel version 3.14, it works if everything goes right (that is: no errors like failing disks or corruptions), but the error handling is still lacking.
  • scrub cannot fix issues with raid5/6 yet. This means that if you have any checksum problem, your filesystem will be in a bad state.
  • btrfs does not yet seem to recognise that if you remove a drive from an array and then later plug it back in the drive is then out of date.
  • btrfs does not handle very well a drive that is present but not working : for example, an attempt to remove a faulty drive from the array (btrfs device delete) fails because it causes or requires reading from the faulty drive itself and thus btrfs will continue to attempt to access the faulty drive forever.
  • … plus some minor but not to critical things

So much for the parity-based status of BTRFS in general – it is evident it is still in its infancy and needs to mature as in its current state it is unusable. (As of December 2014 i´ve seen some patches appearing on the mailing list intended to fix some of these problems but they are not yet in the code.)

During my research I also found some patches from Andrea Mazzoleni that would extend the kernels ability to have up to six parity disks – per file system. Patches to add this functionality to BTRFS were also supplied (see here for the patches).
For use cases where six parity disks are not needed you can of course also pick three or four parity disks – it´s up to you to decide. This would enable BTRFS to create rather large local file systems with a very good resiliency against data loss.

I have tested the patches and they work just fine. So i wondered why they were not added as they would provide a unique feature to BTRFS no other operating system or file system (not even ZFS) has.

My question on the mailing list and the answer can be see here:

LINK #1, LINK #2 and the answer here. Tests with the patches installed can be found here.

Still not satisfied with the answer i mailed some people offline and got the following responses i´d like to show here::

Our plan is based on the upper distributed fs or storage, which can provide
higher reliability than local filesystems, so we think RAID1/RAID10/RAID5/RAID6
is enough for us.

Your work is very very good, it just doesn’t fit our business case.

And (from a different person):

It would be there some day in the long run I guess. But for now
bringing in the support for enterprise use cases and stability
to the btrfs has been our focus.

If course I do agree with the point that stability matters most given the ongoing problems and bugs with BTRFS that can easily cause a loss of a whole file system (e.g. the most recent problems with snapshots in 3.16 and 3.17 kernels, system lockups, poor performance, space problems and so on) but if a file system also includes the functionality of a volume manager then it needs to provide at least some basic features such as:

  • parity-based redundancy
  • hot-spare disks or n-way-mirroring (see below)
  • mirroring
  • striping
  • n-way-mirroring if there is not support for hot-spare disks (that means more than one copy of the data to protect against a failure of more than one disk in a mirrored configuration)

Currently the *only* working implementation which offers resiliency against disk failures that BTRFS provides is mirroring. Ultimately this means RAID1 for a file system with exactly two disks or RAID1+0 for a file system with more than two disks. A failure of more than one disk is fatal in every case. Hot-spares? Not available and nobody is working on it. N-Way-Mirroring? Not available and nobody claimed it. RAID5/6? You guess it… not available/usable at the moment. No mirroring at all? That´s not what i´d use BTRFS for as the whole idea of a checksum based file system is to detect and repair corruptions which is impossible in a scenario with no redundancy.

The claimed on ongoing feature development for BTRFS can be found here. ##link here to the wiki list## As you see – nobody is working on hot-spare, n-way-mirroring support or RAID5/RAID6.

So returning to the answer I got offline i´d like to ask what the overall goal of BTRFS is and for whom it is developed – the so called “(our) business case” in the cited text. To me it seems the focus is not to be a general purpose linux file system; instead it seems to be more a file system for large enterprises where the data redundancy is already made at lower (= storage) levels. Support for large local file systems like ZFS offers is not available and (according to my impression) not desired.

Now if you read through all of this and you want to comment like

“… then develop and post patches – they´d be highly welcomed”

then I tell you: They are obviously not. At least on BTRFS.

The patches posted were working fine and can easily be adapted for the upstream kernel. Given the premature status of the RAID5/6 support in BTRFS it´s the ideal time to add them – before a lot of work is put into the RAID5/6 code. So why hesitating?

Neil Brown who is supporting the kernel functions for the raid code would send the patches upstream immediately – if someone (that means either MDADM itself or BTRFS) would use them. MDADM patches are not ready (yet?) and BTRFS does´t want them it seems.

So yeah.. so much for BTRFS and parity-based redundancy.

For MDADM it looks a bit better. The effort needed to change the code in MDADM is a non-trivial one and currently Andrea Mazzoleni who developed the patches does not have any time to work on a MDADM patch. I cannot do it myself as I am a pretty bad software developer… If you think you can do it i´d be happy to help you as a tester or with equipment so that at least MDADM get´s a much better parity implementation. Please contact me if you are willing to do this.

So which options are left for large local file systems on Linux?

You could either use ZFS on Linux which support up to three parity disks per VDEV (a number of disks are grouped into a VDEV onto which the chosen parity is applied) and then scale by adding more VDEVs – or you could use several smaller RAID6 (i would not recommend RAID5 due to the high risk of losing the raid if a disk fails and you need to rebuild) and group them together with LVM on top. Leaving the licensing discussion aside ZFS on Linux works quite well but requires a lot of manual work to get it working and running. If it runs it is stable – just make sure you use the latest ZFS code. On top of that ZFS is by far more mature than BTRFS while offering all of it´s features and more.

4 thoughts on “Parity-based redundancy (RAID5/6/triple parity and beyond) on BTRFS and MDADM (Dec 2014)”

  1. Well, a pull request was just granted by Linus for the missing BTRFS 5/6 functionality (rebuilding and scrubbing).

    Patience, padiwan!

    Also, in what universe does ZFS offer all of the features of BTRFS?

    Example: Can ZFS grow an array from 7 disks to 8? Hint: the answer rhymes with “No.”

    1. Yes RAID5/6 is coming but the key aspect here is that BTRFS could have had a parity implementation that supports more than just one or two parity disks. For large file systems (or pools) this really becomes a problem… i would not want to use a RAID6 file system with 20+ disks with BTRFS and RAID6 alone. With a lot of disks you would need to create a RAID5 or RAID6 on a lower level (e.g. with MDADM or with a hardware raid controller) and then put these disks into a btrfs file system.

      With the offered implementation you could easily create a file system with 20 data disks and 5 parity disks. This becomes more and more important for really large file systems which are not uncommon nowadays.

      And yes… ZFS lacks some features. The ability to change a VDEV (or a raid group if you want to call it like that) from 7 to 8 disks is one. But you can surely grow a ZFS file system by adding another VDEV. For example if you have one pool with one VDEV consisting of 7 disks you can add a second VDEV containing again 7 disks. What you can´t do (at least with RAIDZ) is to add just one disk.

      Oh and while we speak ZFS also lacks a defragmentation utility. But apart from that… ZFS is quite complete.

      But i think you really did not get my main point: Before you go and add more and more features you do care about the supported RAID levels. Hotspares , triple mirroring and – for big file systems – more than just RAID5/6 are one of them. Another aspect was that a *working* patch was declined beause of… well… good question? Because it “doesn´t fit our business case”. So what would that business case be? I personally think that storage vendors like Fujitsu who are developing BTRFS surely have no interest in enabling users to completely bypass their storage products. Maybe this is one piece of the truth?

  2. Hello, thanks for this writeup. This is very interesting and also a very sad read. It seems to me that if you want anything like RAID60 or triple parity, we will be stuck on ZoL for a long time.

    1. Hi,

      yes – that´s how it looks like to me. Really large file systems (> 50-100 TB scale) with lots of disks are impossible with this setup – unless you already do some sort of RAID on a lower (i.e. storage) level.

Leave a Reply

Your email address will not be published.