Parity-based redundancy (RAID5/6/triple parity and beyond) on BTRFS and MDADM (Dec 2014)

While reading the following posts (second post) from Adam Leventhal

I have wondered what is exactly the status of parity-based redundancy in general and triple-parity in particular on Linux in MDADM and BTRFS.

TL;DR: There are patches to extend the linux kernel to support up to 6 parity disks but BTRFS does not want them because it does not fit their “business case” and MDADM would want them but somebody needs to develop patches for the MDADM component. The kernel raid implementation is ready and usable. If someone volunteers to do this kind of work i would support with equipment and myself as a test ressource.

BTRFS has some preliminary support for RAID5/6 but it is completely unusable in it´s current state. Some highlights:

  • As of kernel version 3.14, it works if everything goes right (that is: no errors like failing disks or corruptions), but the error handling is still lacking.
  • scrub cannot fix issues with raid5/6 yet. This means that if you have any checksum problem, your filesystem will be in a bad state.
  • btrfs does not yet seem to recognise that if you remove a drive from an array and then later plug it back in the drive is then out of date.
  • btrfs does not handle very well a drive that is present but not working : for example, an attempt to remove a faulty drive from the array (btrfs device delete) fails because it causes or requires reading from the faulty drive itself and thus btrfs will continue to attempt to access the faulty drive forever.
  • … plus some minor but not to critical things

So much for the parity-based status of BTRFS in general – it is evident it is still in its infancy and needs to mature as in its current state it is unusable. (As of December 2014 i´ve seen some patches appearing on the mailing list intended to fix some of these problems but they are not yet in the code.)

During my research I also found some patches from Andrea Mazzoleni that would extend the kernels ability to have up to six parity disks – per file system. Patches to add this functionality to BTRFS were also supplied (see here for the patches).
For use cases where six parity disks are not needed you can of course also pick three or four parity disks – it´s up to you to decide. This would enable BTRFS to create rather large local file systems with a very good resiliency against data loss.

I have tested the patches and they work just fine. So i wondered why they were not added as they would provide a unique feature to BTRFS no other operating system or file system (not even ZFS) has.

My question on the mailing list and the answer can be see here:

LINK #1, LINK #2 and the answer here. Tests with the patches installed can be found here.

Still not satisfied with the answer i mailed some people offline and got the following responses i´d like to show here::

Our plan is based on the upper distributed fs or storage, which can provide
higher reliability than local filesystems, so we think RAID1/RAID10/RAID5/RAID6
is enough for us.

Your work is very very good, it just doesn’t fit our business case.

And (from a different person):

It would be there some day in the long run I guess. But for now
bringing in the support for enterprise use cases and stability
to the btrfs has been our focus.

If course I do agree with the point that stability matters most given the ongoing problems and bugs with BTRFS that can easily cause a loss of a whole file system (e.g. the most recent problems with snapshots in 3.16 and 3.17 kernels, system lockups, poor performance, space problems and so on) but if a file system also includes the functionality of a volume manager then it needs to provide at least some basic features such as:

  • parity-based redundancy
  • hot-spare disks or n-way-mirroring (see below)
  • mirroring
  • striping
  • n-way-mirroring if there is not support for hot-spare disks (that means more than one copy of the data to protect against a failure of more than one disk in a mirrored configuration)

Currently the *only* working implementation which offers resiliency against disk failures that BTRFS provides is mirroring. Ultimately this means RAID1 for a file system with exactly two disks or RAID1+0 for a file system with more than two disks. A failure of more than one disk is fatal in every case. Hot-spares? Not available and nobody is working on it. N-Way-Mirroring? Not available and nobody claimed it. RAID5/6? You guess it… not available/usable at the moment. No mirroring at all? That´s not what i´d use BTRFS for as the whole idea of a checksum based file system is to detect and repair corruptions which is impossible in a scenario with no redundancy.

The claimed on ongoing feature development for BTRFS can be found here. ##link here to the wiki list## As you see – nobody is working on hot-spare, n-way-mirroring support or RAID5/RAID6.

So returning to the answer I got offline i´d like to ask what the overall goal of BTRFS is and for whom it is developed – the so called “(our) business case” in the cited text. To me it seems the focus is not to be a general purpose linux file system; instead it seems to be more a file system for large enterprises where the data redundancy is already made at lower (= storage) levels. Support for large local file systems like ZFS offers is not available and (according to my impression) not desired.

Now if you read through all of this and you want to comment like

“… then develop and post patches – they´d be highly welcomed”

then I tell you: They are obviously not. At least on BTRFS.

The patches posted were working fine and can easily be adapted for the upstream kernel. Given the premature status of the RAID5/6 support in BTRFS it´s the ideal time to add them – before a lot of work is put into the RAID5/6 code. So why hesitating?

Neil Brown who is supporting the kernel functions for the raid code would send the patches upstream immediately – if someone (that means either MDADM itself or BTRFS) would use them. MDADM patches are not ready (yet?) and BTRFS does´t want them it seems.

So yeah.. so much for BTRFS and parity-based redundancy.

For MDADM it looks a bit better. The effort needed to change the code in MDADM is a non-trivial one and currently Andrea Mazzoleni who developed the patches does not have any time to work on a MDADM patch. I cannot do it myself as I am a pretty bad software developer… If you think you can do it i´d be happy to help you as a tester or with equipment so that at least MDADM get´s a much better parity implementation. Please contact me if you are willing to do this.

So which options are left for large local file systems on Linux?

You could either use ZFS on Linux which support up to three parity disks per VDEV (a number of disks are grouped into a VDEV onto which the chosen parity is applied) and then scale by adding more VDEVs – or you could use several smaller RAID6 (i would not recommend RAID5 due to the high risk of losing the raid if a disk fails and you need to rebuild) and group them together with LVM on top. Leaving the licensing discussion aside ZFS on Linux works quite well but requires a lot of manual work to get it working and running. If it runs it is stable – just make sure you use the latest ZFS code. On top of that ZFS is by far more mature than BTRFS while offering all of it´s features and more.

Linux: Configuring iSCSI Multipathing

A few days before i posted a short howto how to configure iSCSI multipathing with Nexenta. This post covers the configuration of the linux initiator using iSCSI multipathing.

Before we start a preleminary note: It is a very good idea (i´d call it: “required”) to use separate subnets for each physical interface.  Do NOT use the same subnet accross different network interfaces!

If you do not comply with this simple rule you will end up having problems with so called Arp Flux (also documented here, here, here and so on) which requires further modifications.

For configuring and using iSCSI multipathing the following packages are needed:

  • device-mapper-multipath
  • device-mapper-multipath-libs
  • iscsi-initiator-utils

Our testlab used a VM based on Oracle Enterprise Linux 6 Update 2 with two physical interfaces:

  • eth1:    192.168.1.200/24
  • eth3:    192.168.10.2/24

Continue reading Linux: Configuring iSCSI Multipathing

ext4 file systems and the 16 TB limit – how to *solve* it

File systems do have limits. Thats no surprise. ext3 had a limit at 16 TB file system size. If you needed more space you´d have to use another file system for instance XFS or JFS or spilt the capacity into multiple mount points.

ext4 was designed to allow far more larger file systems than ext3. According to wikipedia ext4 has a maximum file system size of 1 EiB (approx. one exabyte or 1024 PB or 1024*1024 TB).

Now if you´d try to create one single large file system with ext4 on every linux distribution out there (including OEL 6.1; as of 18th August 2011) you will end up with:

[root@localhost ~]# mkfs.ext4 /dev/iscsi/test mke4fs 1.41.9 (22-Aug-2009)
mkfs.ext4: Size of device /dev/iscsi/test too big to be expressed in 32 bit susing a blocksize of 4096.

This post is about how to solve the issue.

Continue reading ext4 file systems and the 16 TB limit – how to *solve* it