Home > UNIX > ext4 file systems and the 16 TB limit – how to *solve* it

ext4 file systems and the 16 TB limit – how to *solve* it

File systems do have limits. Thats no surprise. ext3 had a limit at 16 TB file system size. If you needed more space you´d have to use another file system for instance XFS or JFS or spilt the capacity into multiple mount points.

ext4 was designed to allow far more larger file systems than ext3. According to wikipedia ext4 has a maximum file system size of 1 EiB (approx. one exabyte or 1024 TB).

Now if you´d try to create one single large file system with ext4 on every linux distribution out there (including OEL 6.1; as of 18th August 2011) you will end up with:

[root@localhost ~]# mkfs.ext4 /dev/iscsi/test mke4fs 1.41.9 (22-Aug-2009)
mkfs.ext4: Size of device /dev/iscsi/test too big to be expressed in 32 bit susing a blocksize of 4096.

This post is about how to solve the issue.

The demo system

My demo system consists of one large LUNof 18 TB encapsulated in LVM with a logical volume of 17 TB on a Oracle Enterprise Linux (OEL 5.5):

[root@localhost ~]# uname -a
Linux localhost.localdomain 2.6.18-194.el5 #1 SMP Mon Mar 29 22:10:29 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
[root@localhost ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.5 (Tikanga)
[root@localhost ~]# fdisk -l /dev/sdb
Disk /dev/sdb: 19791.2 GB, 19791209299968 bytes
255 heads, 63 sectors/track, 2406144 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes
Disk /dev/sdb doesn't contain a valid partition table 

[root@localhost ~]# vgdisplay iscsi
--- Volume group ---
VG Name               iscsi
System ID
Format                lvm2
Metadata Areas        1
Metadata Sequence No  2
VG Access             read/write
VG Status             resizable
MAX LV                0
Cur LV                1
Open LV               0
Max PV                0
Cur PV                1
Act PV                1
VG Size               18.00 TB
PE Size               4.00 MB
Total PE              4718591
Alloc PE / Size       4456448 / 17.00 TB
Free  PE / Size       262143 / 1024.00 GB
VG UUID               tdi4f2-3ZYr-c1P0-NuSl-i3w2-5qQl-K75guj
[root@localhost ~]# lvdisplay iscsi
--- Logical volume ---
LV Name                /dev/iscsi/test
VG Name                iscsi
LV UUID                8q1UrT-ludC-FEkT-NExO-4Gzd-cn5H-FYJcB1
LV Write Access        read/write
LV Status              available
# open                 0
LV Size                17.00 TB
Current LE             4456448
Segments               1
Allocation             inherit
Read ahead sectors     auto
- currently set to     256
Block device           253:2

Creating file systems  larger than 16TB with ext4:

If you try to create a ext4 file system on the 17 TB logical volume:

[root@localhost ~]# mkfs.ext4 /dev/iscsi/test mke4fs 1.41.9 (22-Aug-2009)
mkfs.ext4: Size of device /dev/iscsi/test too big to be expressed in 32 bit susing a blocksize of 4096.

OK. Maybe with ext4dev:

[root@localhost ~]# mkfs.ext4dev /dev/iscsi/test mke4fs 1.41.9 (22-Aug-2009)
mkfs.ext4dev: Size of device /dev/iscsi/test too big to be expressed in 32 bits using a blocksize of 4096.

Nope – no success. The reason behind that are the e2fsprogs (or how they are called on OEL: e4fsprogs) are not able to deal with file systems larger than ~ 16 TB.

To be specific: Even with the most recent e2fsprogs 1.41.14 there is no way to create file systems larger than 16 TB.

But: According to this post it should work since June:

It’s taken way too long, but I’ve finally finished integrating the 64-bit patches into e2fsprogs’s mainline repository. All of the necessary patches should now be in the master branch for e2fsprogs. The big change from before is that I replaced Val’s changes for fixing up how mke2fs picked the correct fs-type profile from mke2fs.conf with something that I think works much better and leaves the code much cleaner. With this change you need to add the following to your /etc/mke2fs.conf file if you want to enable the 64-bit feature flag automatically for a big disk:

[fs_types] ext4 = {
features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize
auto_64-bit_support = 1 # <—- add this line
inode_size = 256
}

Alternatively you can change the features line to include the feature “64bit”; this will force the use of the 64-bit fields, and double the size of the block group descriptors, even for smaller file systems that don’t require the 64-bit support. (This was one of my problems with Val’s implementation; it forced the mke2fs.conf file to always enable the 64-bit feature flag, which would cause backwards compatibility issues.) This might be a good thing to do for debugging purposes, though, so this is an option which I left open, but the better way of doing things is to use the auto_64-bit-support flag.

So the change must be there. A short look at the ‘WIP’ (work-in-progress) branch of the e2fsprogrs confirmed the integration.

So i tried to build the most recent e2fsprogs (Remeber: This are *development* tools – use at your OWN RISK):

[root@vm-mkmoel ~] git clone git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
[root@vm-mkmoel ~]# cd e2fsprogs
[root@vm-mkmoel e2fsprogs]# mkdir build ; cd build/
[root@vm-mkmoel build]# ../configure
[root@vm-mkmoel build]# make
[root@vm-mkmoel build]# make install

So let´s try to create a file system:

[root@vm-mkmoel misc]# ./mke2fs -O 64bit,has_journal,extents,huge_file,flex_bg, \
uninit_bg,dir_nlink,extra_isize -i 4194304 /dev/iscsi/test 

mke2fs 1.42-WIP (02-Jul-2011)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
4456448 inodes, 4563402752 blocks
228170137 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=6710886400
139264 block groups
32768 blocks per group, 32768 fragments per group
32 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
2560000000, 3855122432
Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 0 mounts or 0 days,
whichever comes first.  Use tune2fs -c or -i to override.

OK. Seems to have worked. Lets check it:

[root@vm-mkmoel misc]# mount /dev/iscsi/test /mnt
[root@vm-mkmoel misc]# df -h
Filesystem                          Size Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00     18G  2.6G   14G  16% /
/dev/sda1                           99M  13M  82M    14% /boot
tmpfs                               502M 0    502M   0% /dev/shm
/dev/mapper/iscsi-test              17T  229M   17T   1% /mnt
[root@vm-mkmoel misc]# mount | grep mnt
/dev/mapper/iscsi-test on /mnt type ext4 (rw)

As you can see: With the most recent development e2fsprogrs it is possible to create ext4 file systems larger than 16 TB.

I even tried it with a 50 TB file system (because thats what i needed i my use case):

[root@vm-mkmoel misc]# df -h
Filesystem                          Size Used Avail Use% Mounted on
/dev/mapper/iscsi-test              50T  237M   48T   1% /mnt

Update:

Today i tested some more user space tools.

fsck

Maybe the most important tool in case the journaling fails. I copied some data to the file system (roughly about 2 TB) and had 73% of my 6.5 million inodes (one inode per 8 MB) allocated. Running fsck on my demo system with 1 GB memory yields:

[root@vm-mkmoel ~]# fsck.ext4 -f /dev/iscsi/test
e2fsck 1.42-WIP (02-Jul-2011)
Pass 1: Checking inodes, blocks, and sizes
Error allocating block bitmap (4): Memory allocation failed

fsck is some kind of messy with memory. Increasing the memory to 8 GB did it. While running fsck i noticed a memory consumption of up to 3.4 GB! So large file systems require a lot of memory for fscking. It requires even more memory with more inodes!

resize2fs

After fscking my file system i tried to resize it:

[root@localhost sbin]# lvresize -l +7199 /dev/iscsi/test
  Extending logical volume test to 50.00 TB
  Logical volume test successfully resized
[root@localhost sbin]# resize2fs /dev/iscsi/test
resize2fs 1.42-WIP (02-Jul-2011)
resize2fs: New size too large to be expressed in 32 bits

As you can see resizing the file system is not yet supported/implemented. So it would be wise to create the file system with the final size from start since growing is NOT possible!

tune2fs

tune2fs seems to work – at least it dumps the suberblock contents:

[root@localhost sbin]# tune2fs -l /dev/iscsi/test
tune2fs 1.42-WIP (02-Jul-2011)
Filesystem volume name:   <none>
Last mounted on:          /mnt/mnt
Filesystem UUID:          a754e947-8b89-415d-909d-000e6c95c44a
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              6550000
Block count:              13414400000
Reserved block count:     670720000
Free blocks:              13394134177
Free inodes:              1484526
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1024
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         16
Inode blocks per group:   1
Flex block group size:    16
Filesystem created:       Wed Oct 19 17:09:06 2011
Last mount time:          Wed Oct 19 18:45:47 2011
Last write time:          Wed Oct 19 18:45:47 2011
Mount count:              1
Maximum mount count:      20
Last checked:             Wed Oct 19 18:35:36 2011
Check interval:           0 (<none>)
Lifetime writes:          2511 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      ea117174-a04a-412e-a067-7972804f83d7
Journal backup:           inode blocks

Setting properties works as well:

[root@localhost sbin]# tune2fs -L test /dev/iscsi/test
tune2fs 1.42-WIP (02-Jul-2011)
[root@localhost sbin]# tune2fs -l /dev/iscsi/test | head -10
tune2fs 1.42-WIP (02-Jul-2011)
Filesystem volume name:   test
Last mounted on:          /mnt/mnt
[...]

e4defrag

e4defrag is a new tool to defragment the ext4 file system. According to the man page:

e4defrag  reduces  fragmentation of extent based file. The file targeted by e4defrag is created on ext4 filesystem made with “-O extent” option (see  mke2fs(8)).   The  targeted  file gets more contiguous blocks and improves the file access speed.

I am not yet sure how this affects file systems used for oracle datafiles. All i can say is that e4defrag seems to work with >16 TB file systems:

 

[root@localhost sbin]# e4defrag /mnt/
ext4 defragmentation for directory(/mnt/)
[....]
        Success:                        [ 4772040/5065465 ]
        Failure:                        [ 293425/5065465 ]

The failures are from directories which cannot be defragmented.

 

Conclusion

With the most recent e2fstools (1.42-WIP) it is possible to create ext4 file system larger than 16 TB.

If you do so remember the following:

  • the tool is still in development – use at your own risk!
  • tune the values for autocheck (after x mounts / after y days)
  • adjust the “-i” switch which defnes the bytes/inode ratio; in the example above one inode is created for every 8 MB
  • the more inodes you create the longer fsck takes and the more memory it needs
  • Resizing the file system (growing / shrinking) is NOT possible at the moment
Categories: UNIX Tags:
  1. Kohai
    September 1st, 2011 at 19:25 | #1

    Thanks for the info, we have 1.42-WIP too and just realized we can’t use resize2fs. It was good to see your write up to confirm we’re not crazy — it doesn’t work yet. (Our system is a production system, so we get to wait until they fix it.)

  2. Brian Candler
    November 1st, 2011 at 13:46 | #2

    Excellent article, thank you. Just one minor note: 1 exabyte is 1024 PB, which is 1048576 TB.

  3. Roman
    December 26th, 2011 at 16:25 | #3

    Doesn’t seem like the lates git pull tune2fs will let you set -O 64bit on an existing FS.

    “Setting filesystem feature ’64bit’ not supported.”

    There seems to be a patch in there to allow resize2fs to go beyond 16TB, but it won’t do it @ 32 bit, and yet you can’t modify it to 64bit with tune.

  4. March 3rd, 2012 at 04:55 | #4

    yes.
    just this line “auto_64-bit_support = 1 ” solved 16T limits.
    thanks!

  5. Ronny Egner
    March 9th, 2012 at 10:52 | #5

    Yes but only with e2fstools >= 1.42

  6. E. Florac
    April 10th, 2012 at 15:53 | #6

    Or you could use XFS and call it a day :) I’ve been using XFS volumes in the 40 to 150 TB for ages without any tinkering…

  7. Rusty
    April 22nd, 2012 at 01:35 | #7

    Do you know if there have been any updates to this procedure? e.g. does resize2fs work with some combination of kernel/e2fsprogs versions? I have an ext4 partition that I’d like to extend beyond 16TB, is this possible today? The release notes say online resize >16tb came with 1.42 and requires kernel 3.2, but I have had no success. Additionally, tune2fs -O 64bit does not work (even with “auto_64-bit_support = 1″), is there some workaround for this? Thanks for any help.

  8. Ronny Egner
    April 22nd, 2012 at 19:55 | #8

    True. Sadly XFS does not support shrinking the file system – a feature i am utterly missing. However using XFS for Oracle databases is something i would not recommend since XFS is supported but not tested that much. If you need more than 16 TB you´d better go with ASM anyway…

  9. Ronny Egner
    April 22nd, 2012 at 19:56 | #9

    I dont know to be honest. But i can test this…

  10. John
    September 2nd, 2012 at 17:55 | #10

    Thanks. Your article was a big help. It’s a bummer that resize2fs does not have 64bit support yet, though. Any idea when that may come? I wonder how much effort it would be to tweak the code…

  11. Ronny Egner
    September 27th, 2012 at 11:49 | #11

    No idea. I asked the developer but did not received an answer…

  12. Kashif
    October 3rd, 2012 at 19:46 | #12

    Hi Ronny, just curious how stable you’ve found this solution to be thus far & how heavily loaded? Looks like I’m considering doing something similar but with +/- half the size/24TB.

  13. Ronny Egner
    October 16th, 2012 at 09:54 | #13

    Hi,

    i´ve not encountered any issues so far. Just make sure the file system is correctly set up (including reserved blocks). Until now tune2fs does not support changing it.

  14. December 4th, 2012 at 15:16 | #14

    From what I learned so far, for resizing of ext4 with 64Bit inodes you need a 3.7 kernel:
    http://www.h-online.com/open/features/Kernel-Log-Coming-in-3-7-Part-1-Filesystems-storage-1750000.html

    Also you should be careful… here is some discussion about a guy who broke his fs:
    http://www.spinics.net/lists/linux-ext4/msg35043.html

    So there are some pitfalls… :-(

    Don’t like XFS, because it can’t be shrinked, right? Maybe I’ll go for btrfs… It seems that I need to create a new partition anyway, can even try a different system and very much like the b-tree stuff in btrfs.
    What do you guys think? Pretty unsure right now…

  15. Joe B
    February 17th, 2013 at 01:59 | #15

    Thanks for the great posting, just tried it and it works

    [root@hscaler-cn001 misc]# df -h
    Filesystem Size Used Avail Use% Mounted on
    /dev/sda2 916G 57G 813G 7% /
    none 16G 4.0K 16G 1% /dev/shm
    master:/cm/shared 228G 39G 178G 18% /cm/shared
    master:/home 228G 39G 178G 18% /home
    /dev/mapper/cn018_1 22T 109M 21T 1% /data/0

    I have my 22TB FS using ext4
    [root@hscaler-cn001 misc]# mount
    /dev/sda2 on / type ext4 (rw,noatime,nodiratime)
    none on /proc type proc (rw,nosuid)
    none on /sys type sysfs (rw)
    none on /dev/pts type devpts (rw,gid=5,mode=620)
    none on /dev/shm type tmpfs (rw)
    none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
    master:/cm/shared on /cm/shared type nfs (rw,rsize=32768,wsize=32768,hard,intr,nfsvers=3,addr=10.141.255.254)
    master:/home on /home type nfs (rw,rsize=32768,wsize=32768,hard,intr,nfsvers=3,addr=10.141.255.254)
    /dev/mapper/cn018_1 on /data/0 type ext4 (rw,noatime,nodiratime)

  16. March 11th, 2013 at 13:41 | #16

    Hey, Ronny.

    Just running into the same issue here with a new CentOS 6.3 installation for use with GlusterFS. We compiled a e2fsprogs 1.42.7 from git (installed to /usr/local/*) but our problem is that fsck searches /sbin statically on boot, before it searches the $PATH.

    For now we’ve left our file system as ‘0’ in fstab (ie, don’t fsck), but that’s obviously a terrible solution…

    What did you do to get around this?

  17. Gluster User
    May 17th, 2013 at 20:55 | #17

    Hi Alan,

    Gluster install doco says use XFS. So there should be no problem with ext4. Your boot disk and apps etc can all be standard ext4. However your gluster mount points are xfs. Works fine.

    Gluster user

  18. Anonymous
    August 21st, 2013 at 04:05 | #18

    This is very important. the command:

    “[root@vm-mkmoel misc]# ./mke2fs -O 64bit,has_journal,extents,huge_file,flex_bg, \
    uninit_bg,dir_nlink,extra_isize -i 4194304 /dev/iscsi/test”

    listed above to format the new array in to a single large file system includes a specific option “-i 4194304″ to set the number of inodes. The number of inodes it creates is literally almost one THOUSANDTH of the number that will be created if you leave that option out. With that few inodes my array ran out of the ability to write after only being filled with 2.2TB of data.
    I discovered this when i created an mdadm raid6 with sixteen 2TB drives and tried to copy an rsnapshot back up from an older smaller raid to it. I got 2.2TB in to the transfer and rsnapshot failed with an error “No space left on device (28)”.

    a quick looksee and i ran a ‘df -i’ and figured out that i had run out of inodes even though there was still plenty of space on the partition.

    slackware 14 on an ibm something or other with ram and hard drives and stuff.

  19. unifex
    August 21st, 2013 at 04:07 | #19

    This is very important. the command:

    “[root@vm-mkmoel misc]# ./mke2fs -O 64bit,has_journal,extents,huge_file,flex_bg, \
    uninit_bg,dir_nlink,extra_isize -i 4194304 /dev/iscsi/test”

    listed above to format the new array in to a single large file system includes a specific option “-i 4194304″ to set the number of inodes. The number of inodes it creates is literally almost one THOUSANDTH of the number that will be created if you leave that option out. With that few inodes my array ran out of the ability to write after only being filled with 2.2TB of data.
    I discovered this when i created an mdadm raid6 with sixteen 2TB drives and tried to copy an rsnapshot back up from an older smaller raid to it. I got 2.2TB in to the transfer and rsnapshot failed with an error “No space left on device (28)”.

    a quick looksee and i ran a ‘df -i’ and figured out that i had run out of inodes even though there was still plenty of space on the partition.

    slackware 14 on an ibm something or other with ram and hard drives and stuff.

  20. unifex
    August 21st, 2013 at 17:29 | #20

    im sorry that last post of mine was a bit premature.
    i had been comparing the output of df -i to another raid of mine which i had forgotten was formatted with XFS not ext so of course the output was different.

    running the mkfs command without specifying -i as i suggested in my last post actually lowered the number of inodes and made the situation worse. i then did some experimenting and re ran it specifying -i 1024
    and THAT yielded the expected results of …four BILLION inodes :) which is both a thousand times more than i had before, and which should be enough.

    root@argon2014:~# df -i
    Filesystem Inodes IUsed IFree IUse% Mounted on
    /dev/sda1 121217024 322630 120894394 1% /
    tmpfs 752693 1 752692 1% /dev/shm
    /dev/md127p1 4292801024 11 4292801013 1% /media/md16p1

    slack 14 on an ibm something or other with ram and hard drives and stuff.

  1. February 8th, 2012 at 19:16 | #1
  2. June 1st, 2012 at 16:13 | #2
  3. August 20th, 2013 at 15:53 | #3
  4. August 20th, 2013 at 19:53 | #4