Parity-based redundancy (RAID5/6/triple parity and beyond) on BTRFS and MDADM (Dec 2014)

December 10th, 2014 4 comments

While reading the following posts (second post) from Adam Leventhal

I have wondered what is exactly the status of parity-based redundancy in general and triple-parity in particular on Linux in MDADM and BTRFS.

TL;DR: There are patches to extend the linux kernel to support up to 6 parity disks but BTRFS does not want them because it does not fit their “business case” and MDADM would want them but somebody needs to develop patches for the MDADM component. The kernel raid implementation is ready and usable. If someone volunteers to do this kind of work i would support with equipment and myself as a test ressource.

BTRFS has some preliminary support for RAID5/6 but it is completely unusable in it´s current state. Some highlights:

  • As of kernel version 3.14, it works if everything goes right (that is: no errors like failing disks or corruptions), but the error handling is still lacking.
  • scrub cannot fix issues with raid5/6 yet. This means that if you have any checksum problem, your filesystem will be in a bad state.
  • btrfs does not yet seem to recognise that if you remove a drive from an array and then later plug it back in the drive is then out of date.
  • btrfs does not handle very well a drive that is present but not working : for example, an attempt to remove a faulty drive from the array (btrfs device delete) fails because it causes or requires reading from the faulty drive itself and thus btrfs will continue to attempt to access the faulty drive forever.
  • … plus some minor but not to critical things

So much for the parity-based status of BTRFS in general – it is evident it is still in its infancy and needs to mature as in its current state it is unusable. (As of December 2014 i´ve seen some patches appearing on the mailing list intended to fix some of these problems but they are not yet in the code.)

During my research I also found some patches from Andrea Mazzoleni that would extend the kernels ability to have up to six parity disks – per file system. Patches to add this functionality to BTRFS were also supplied (see here for the patches).
For use cases where six parity disks are not needed you can of course also pick three or four parity disks – it´s up to you to decide. This would enable BTRFS to create rather large local file systems with a very good resiliency against data loss.

I have tested the patches and they work just fine. So i wondered why they were not added as they would provide a unique feature to BTRFS no other operating system or file system (not even ZFS) has.

My question on the mailing list and the answer can be see here:

LINK #1, LINK #2 and the answer here. Tests with the patches installed can be found here.

Still not satisfied with the answer i mailed some people offline and got the following responses i´d like to show here::

Our plan is based on the upper distributed fs or storage, which can provide
higher reliability than local filesystems, so we think RAID1/RAID10/RAID5/RAID6
is enough for us.

Your work is very very good, it just doesn’t fit our business case.

And (from a different person):

It would be there some day in the long run I guess. But for now
bringing in the support for enterprise use cases and stability
to the btrfs has been our focus.

If course I do agree with the point that stability matters most given the ongoing problems and bugs with BTRFS that can easily cause a loss of a whole file system (e.g. the most recent problems with snapshots in 3.16 and 3.17 kernels, system lockups, poor performance, space problems and so on) but if a file system also includes the functionality of a volume manager then it needs to provide at least some basic features such as:

  • parity-based redundancy
  • hot-spare disks or n-way-mirroring (see below)
  • mirroring
  • striping
  • n-way-mirroring if there is not support for hot-spare disks (that means more than one copy of the data to protect against a failure of more than one disk in a mirrored configuration)

Currently the *only* working implementation which offers resiliency against disk failures that BTRFS provides is mirroring. Ultimately this means RAID1 for a file system with exactly two disks or RAID1+0 for a file system with more than two disks. A failure of more than one disk is fatal in every case. Hot-spares? Not available and nobody is working on it. N-Way-Mirroring? Not available and nobody claimed it. RAID5/6? You guess it… not available/usable at the moment. No mirroring at all? That´s not what i´d use BTRFS for as the whole idea of a checksum based file system is to detect and repair corruptions which is impossible in a scenario with no redundancy.

The claimed on ongoing feature development for BTRFS can be found here. ##link here to the wiki list## As you see – nobody is working on hot-spare, n-way-mirroring support or RAID5/RAID6.

So returning to the answer I got offline i´d like to ask what the overall goal of BTRFS is and for whom it is developed – the so called “(our) business case” in the cited text. To me it seems the focus is not to be a general purpose linux file system; instead it seems to be more a file system for large enterprises where the data redundancy is already made at lower (= storage) levels. Support for large local file systems like ZFS offers is not available and (according to my impression) not desired.

Now if you read through all of this and you want to comment like

“… then develop and post patches – they´d be highly welcomed”

then I tell you: They are obviously not. At least on BTRFS.

The patches posted were working fine and can easily be adapted for the upstream kernel. Given the premature status of the RAID5/6 support in BTRFS it´s the ideal time to add them – before a lot of work is put into the RAID5/6 code. So why hesitating?

Neil Brown who is supporting the kernel functions for the raid code would send the patches upstream immediately – if someone (that means either MDADM itself or BTRFS) would use them. MDADM patches are not ready (yet?) and BTRFS does´t want them it seems.

So yeah.. so much for BTRFS and parity-based redundancy.

For MDADM it looks a bit better. The effort needed to change the code in MDADM is a non-trivial one and currently Andrea Mazzoleni who developed the patches does not have any time to work on a MDADM patch. I cannot do it myself as I am a pretty bad software developer… If you think you can do it i´d be happy to help you as a tester or with equipment so that at least MDADM get´s a much better parity implementation. Please contact me if you are willing to do this.

So which options are left for large local file systems on Linux?

You could either use ZFS on Linux which support up to three parity disks per VDEV (a number of disks are grouped into a VDEV onto which the chosen parity is applied) and then scale by adding more VDEVs – or you could use several smaller RAID6 (i would not recommend RAID5 due to the high risk of losing the raid if a disk fails and you need to rebuild) and group them together with LVM on top. Leaving the licensing discussion aside ZFS on Linux works quite well but requires a lot of manual work to get it working and running. If it runs it is stable – just make sure you use the latest ZFS code. On top of that ZFS is by far more mature than BTRFS while offering all of it´s features and more.

Categories: UNIX Tags:

Finally: I am now Oracle Certified Master 11g

November 28th, 2014 3 comments

Yes! I made it. I successfully accomplished the exam and i am now a proud member of the OCM family:

OCM 11g Zertifikat Small

Categories: Oracle in general Tags:

Solaris Live Network Bandwidth Monitoring

August 5th, 2014 No comments

When doing performance analysis the current network throughput is often interesting. The following nice script display that – even without any root permissions.

 # usage: netvolmon DEV [INTERVAL]
getrxtx() {
 kstat -p "*:*:$1:*bytes64" |
 awk '{print $2}'
rxtx=`getrxtx $DEV`
 while sleep $IVAL; do
 nrxtx=`getrxtx $DEV`
 (echo $IVAL $rxtx $nrxtx) |
 awk 'BEGIN {
 msg = "%6.2f MB/s RX %6.2f MB/s TX\n"}
 {rxd = ($4 - $2) / (1024*1024*$1);
 txd = ($5 - $3) / (1024*1024*$1);
 printf msg, rxd, txd}'

The script is taken from here:

It produces the following output which is sufficient to get a quick overview about the current network traffic:

bash-3.2$ /export/home/oracle/ igb0 5
  0.59 MB/s RX  20.54 MB/s TX
  1.17 MB/s RX  40.81 MB/s TX
  1.71 MB/s RX  59.72 MB/s TX
Categories: Oracle in general Tags:

ORA-29275: partial multibyte character during RMAN resync

February 12th, 2014 No comments

Today i had a smaller problem registering a newly created database in a rman catalog. The only unusual thing was that the database was using a unicode character set.

RMAN failed with:

[oracle@rac02a ~]$ rman target / catalog rman/rman@rman
Recovery Manager: Release - Production on Wed Feb 12 05:31:49 2014
Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
connected to target database: ORA11 (DBID=781875003)
connected to recovery catalog database
RMAN> show all;
starting full resync of recovery catalog
 RMAN-00571: ===========================================================
 RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
 RMAN-00571: ===========================================================
 RMAN-03002: failure of show command at 02/12/2014 05:31:57
 RMAN-03014: implicit resync of recovery catalog failed
 RMAN-03009: failure of full resync command on default channel at 02/12/2014 05:31:57
 ORA-29275: partial multibyte character

As you can see the database was using Unicode:

 [oracle@rac02a ~]$ sqlplus / as sysdba
SQL*Plus: Release Production on Wed Feb 12 05:32:23 2014
Copyright (c) 1982, 2011, Oracle.  All rights reserved.
Connected to:
 Oracle Database 11g Enterprise Edition Release - 64bit Production
 With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
 Data Mining and Real Application Testing options
SQL> select * from nls_database_parameters;

So i changed the environment to match the character set of the database:

[oracle@rac02a ~]$ export NLS_LANG=GERMAN_GERMANY.AL32UTF8

And… Problem solved!

[oracle@rac02a ~]$ rman target / catalog rman/rman@rman
Recovery Manager: Release - Production on Mi Feb 12 05:32:53 2014
Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
Mit Ziel-Datenbank verbunden: ORA11 (DBID=781875003)
Verbindung mit Datenbank des Recovery-Katalogs
 RMAN> show all;
Vollständige Neusynchronisation des Recovery-Katalogs wird begonnen
 Vollständige Neusynchronisation abgeschlossen
 RMAN-Konfigurationsparameter für Datenbank mit db_unique_name ORA11 sind:
 CONFIGURE SNAPSHOT CONTROLFILE NAME TO '/nfs/backup/rman/ORA11/backup/snapcf_ORA11.f';
Categories: Oracle in general Tags:

Adding more than 512 LUNs to Oracle Enterprise Linux

January 3rd, 2014 No comments

When dealing with large databases it might be neccessary to have more than 512 LUNs attached to the server. Most Linux distributions default to 512 luns max. This gets even worse with multipathing where each path is represented as a single lun.

Error messages like this indicate the limit was hit:

Dec 02 22:24:46 server kernel: scsi: host 2 channel 0 id 2 lun258 has a LUN larger than allowed by the host adapter
 Dec 02 22:24:46 server kernel: scsi: host 2 channel 0 id 2 lun259 has a LUN larger than allowed by the host adapter

So what is needed to change that limit?

First you need to modify the maximum numbers of LUNs allowed by the SCSI stack by adding the following line to /etc/modprobe.conf:

options scsi_mod max_luns=2048

Depending on the SCSI HBA used the driver module itself might need some adjustments. For QLOgic HBAs the following line ist required in /etc/modprobe.conf:

options lpfc lpfc_max_luns=2048

After rebooting or removing and re-inserting the module you should be able to attach 2048 LUNs.

By the way the linux device interface (including the multipath driver) supprts more than 254 LUNs sind Kernel 2.6:

[root@ora1 ~]# vgs 
 VG     #PV #LV #SN Attr   VSize   VFree 
 vgroot   1 388   0 wz--n- 179,88G 94,50G

As you can see the VG has 388 LVs. More than 254 … so we actually should run out of minor numbers in /dev, right? Actually major- and minor numbers were increaed to 10 bits in kernel 2.6:

[root@ora1 mapper]# ls -la /dev/mapper/vgroot-disk00*
brw-rw---- 1 root disk 253,  6 24. Dez 05:27 /dev/mapper/vgroot-disk001
brw-rw---- 1 root disk 253,  7 24. Dez 05:27 /dev/mapper/vgroot-disk002
brw-rw---- 1 root disk 253,  8 24. Dez 05:27 /dev/mapper/vgroot-disk003
brw-rw---- 1 root disk 253, 308 24. Dez 05:27 /dev/mapper/vgroot-disk306
brw-rw---- 1 root disk 253, 309 24. Dez 05:27 /dev/mapper/vgroot-disk307
brw-rw---- 1 root disk 253, 310 24. Dez 05:27 /dev/mapper/vgroot-disk308
brw-rw---- 1 root disk 253, 311 24. Dez 05:27 /dev/mapper/vgroot-disk309
Categories: Oracle in general Tags:

Copy permissions

November 15th, 2013 No comments

Somethimes – especially on RACs – it is required to synchronize permissions on an ORACLE_HOME  running on one node with the permissions running on the other node.


Today i had the problem that after a patch all permissions were garbled. In order to restore service i had to fix them. The first idea here would be to copy the permissions from a working node and here is how you do it easy:

find /data/oragrid/product/ -printf 'chown %U.%G %p\n' > /tmp/
find /data/oragrid/product/ -type d -printf "chmod %m %p \n" > /tmp/
find /data/oragrid/product/ -type f -printf "chmod %m %p \n" > /tmp/

The scripts output for the file permissions looks like this:

chmod 775 /data/oragrid/product/
chmod 755 /data/oragrid/product/
chmod 755 /data/oragrid/product/
chmod 755 /data/oragrid/product/
chmod 755 /data/oragrid/product/

and for the ownership:

chown 0.30275 /data/oragrid/product/
chown 20341.30275 /data/oragrid/product/
chown 20341.30275 /data/oragrid/product/
chown 20341.30275 /data/oragrid/product/
chown 20341.30275 /data/oragrid/product/

Just execute the scipts in any order and you are done.

Categories: Oracle in general Tags:

Recompile database objects (utlrp) fails with: ORA-04045: errors during recompilation/revalidation of SYS.DBMS_REGISTRY_SYS

July 16th, 2013 2 comments

Hi just discovered another flaw when having the workaround for CVE-2012-3132 in place. If you try to compile invalid database objects it will fail with:

SQL> @$ORACLE_HOME/rdbms/admin/utlrp
SELECT dbms_registry_sys.time_stamp('utlrp_bgn') as timestamp from dual
ERROR at line 1:
ORA-04045: errors during recompilation/revalidation of SYS.DBMS_REGISTRY_SYS
ORA-04067: not executed, package body "SYS.NAME_SECURITY" does not exist


The solution for this is:

Either disable the trigger (alter system disable trigger sys.NAMECHECK_BEFORE_DDL_DB_TRG) OR disable the execution of all triggers on system level (ALTER SYSTEM SET "_system_trig_enabled" = FALSE;)

Categories: Oracle in general Tags:

Quick & Dirty: AWR Query for Tablespace groth

July 1st, 2013 No comments

The query:

  rtime as time,
  round(tablespace_size*tsb.block_size/1024/1024,0) SIZE_MB,
  round(tablespace_maxsize*tsb.block_size/1024/1024,0) MAXSIZE_MB,
  round(tablespace_usedsize *tsb.block_size/1024/1024,0) USEDSIZE_MB
  dba_hist_tbspc_space_usage tsu,
  v$tablespace ts,
  dba_hist_snapshot sn,
  dba_tablespaces tsb
  and sn.snap_id= tsu.snap_id
order by
sn.snap_id desc, tablespace_name;


Sample output;

07/01/2013 11:00:30    644    SYSAUX    64000     65536        2154
07/01/2013 11:00:30    644    SYSTEM    6144      8192         2884
07/01/2013 11:00:30    644    TOOLS     100       65536        1
07/01/2013 11:00:30    644    UNDOTBS1  65535     65535        54
07/01/2013 11:00:30    644    USERS     8         8            7
07/01/2013 10:00:28    643    SYSAUX    64000     65536        2154
07/01/2013 10:00:28    643    SYSTEM    6144      8192         2884
07/01/2013 10:00:28    643    TOOLS     100       65536        1
07/01/2013 10:00:28    643    UNDOTBS1  65535     65535        58
07/01/2013 10:00:28    643    USERS     8         8            7
Categories: Oracle in general Tags:

Oracle 12c Release 1: New Features in a Nutshell

June 26th, 2013 No comments

Oracle 12c Release 1 has been released!


Yes, the new generation of Oracles new database platform has been released. You can download it from Oracle


Now as 12c Release 1 is release i am freed from the NDA and allowed to share my information on the most recent Oracle Database Version.

The following slides are from a presentation i held at the 2013 ‘Frankfurter Datenbanktage’ in Germany. I put together a summary of all new Features using all the available sources from Oracle.

You can download the slides here.

Categories: Oracle in general Tags:

How to solve ‘ORA-00600: internal error code, arguments: [18062]’

June 25th, 2013 No comments

A few months back i accidently set my global database name to “null”. As a result the database crashed with “ORA-00600: internal error code, arguments: [18062]”:

SQL> update global_name set global_name=null;
SQL> commit;


Starting background process QMNC
Fri Oct 21 21:38:12 2011
QMNC started with pid=68, OS id=26764
Errors in file /data/oracle/ORADB/admin/diag/rdbms/ORADB/ORADB2/trace/ORADB2_ora_22638.trc  (incident=26069):
ORA-00600: internal error code, arguments: [18062], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /data/oracle/ORADB/admin/diag/rdbms/ORADB/ORADB2/incident/incdir_26069/ORADB2_ora_22638_i26069.trc
Fri Oct 21 21:38:14 2011
db_recovery_file_dest_size of 20000 MB is 71.54% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /data/oracle/ORADB/admin/diag/rdbms/ORADB/ORADB2/trace/ORADB2_ora_22638.trc:
ORA-00600: internal error code, arguments: [18062], [], [], [], [], [], [], [], [], [], [], []
Errors in file /data/oracle/ORADB/admin/diag/rdbms/ORADB/ORADB2/trace/ORADB2_ora_22638.trc:
ORA-00600: internal error code, arguments: [18062], [], [], [], [], [], [], [], [], [], [], []
Error 600 happened during db open, shutting down database
USER (ospid: 22638): terminating the instance due to error 600
Instance terminated by USER, pid = 22638
ORA-1092 signalled during: alter database open...
opiodr aborting process unknown ospid (22638) as a result of ORA-1092
Fri Oct 21 21:38:17 2011
ORA-1092 : opitsk aborting process


According to the Metalink articles your database is toast – you have to restore it from a backup. But what if there is no backup and you are not that familiar with BBED (the oracle block editor)?

Here is how to solve the issue:

Read more…

Categories: Oracle in general Tags: