Beneath the battle of the operating systems (Windows vs.UNIX) there one argumentation after deciding for running oracle on Linux: The choice of the file system and the “correct” partition layout.
This article will sum up what file system are certified for running oracle on Linux (and other platforms as well) and will especially discuss the available file systems on Linux. My opinion regarding an optimal partition layout is written here.
Recommendations regarding the partition layout can be found in an older post here.
File system basics
What is a file system?
Before wring a definition myself lets take a look what Wikipedia tells us a file system is:
[…] a method for storing and organizing computer files and the data they contain to make it easy to find and access them. File systems may use a data storage device such as a hard disk or CD-ROM and involve maintaining the physical location of the files, they might provide access to data on a file server by acting as clients for a network protocol (e.g., NFS, SMB, or 9P clients), or they may be virtual and exist only as an access method for virtual data (e.g., procfs).
More formally, a file system is a special-purpose database for the storage, organization, manipulation, and retrieval of data.
For the DBA a file system is the place he stores all the database-related files (data files, redo logs, archive logs, (s)pfile, bkacups) in if he is not using ASM or Raw disks :-)
File system features
For running databases the following features are most important:
- Robust/Reliable: In my opinion THE most important feature. Losing data due to bugs in the file system is extremely annoying and unnecessary.
- Full data journaling/meta data-only journaling: This mechanism prevents a complete file system check after a crash which might take several hours on large file systems. When running oracle databases a meta data-only journaling (i.e. file creations, deletes are journaled but not the changed data) is superior over a full journaling file system because the database already does data journaling.
- Resizing: The file system should generally have the ability to resize the file system (grow and shrink).
- Online Resizing: A superior file system should have the ability to perform the resize operations online to minimize downtimes.
- Widely used: A file system should be used by as many users as possible around the world so bugs can be found faster.
Features like compression or de-duplication are not needed for file systems used by oracle databases. The needed features do already exist in the database.
The use of Volume Managers depends. There are pros and cons for using them. I´ve already outlined them in a separate posting and generally tend to use them for operating system and database binaries but not for partitions containing data files because they add an extra layer of software. If i do need flexibility or host based mirroring i use ASM.
Different Platform – Different File System
Depending on the platform you run your Oracle database on the file system will be different. ASM and Raw Devices are available on all platforms. Users who don’t want to place files in ASM or on raw disk must use some kind of file system. The following is a non-complete list of available file systems for different platforms:
- All platforms
- Raw Devices
- (OCFS2) ??
- VxFS / VxVM for extra money
- JFS (aka. VxFS)
- Online JFS
- VxFS / VxVM for extrea money
File system details and limits
There are many, many file systems available for different platforms.Wikipedia has a nice page showing most of the available file systems together with their features, limits and operating system support.
I summarized some limits regarding maximum file size, maximum volume size and operating system supprort in the following table with focus on oracle certified file systems.
[table id=2 /]
- The values stated in the table are in GiB, TiB. This are space calculations based on binary multiple of the byte. For instance one GiB equals to 2^30= 1073741824 Bytes = 1024 MiB whereas a GB is defined as 10^9 = 1000000000 bytes.
- Note that although the maximum allowed file size may be several TiB the oracle database with smallfile tablespaces supports only 4.2 million *blocks* per smallfile datafile. For a block size of 8 KB this equals to 32 GB per datafile. 16 kb block size doubles the maximum data file size to 64 GB. If bigfile tablespace are used tablespaces with 8 kb block size can contain one bigfile data file with a total size of up to 32 Terrabyte. Using 32 KB block size raises this limit to 128 Terrabyte.
- The table contains the matrix of genereally supported file system and oracle versions. It does *not* contain a list of supported features (e.g. Async IO, Direct IO, …) for the combination of file system, distribution, feature (e.g. async/direct io) and oracle database version. For these information refer to the metalink notes named below.
- I tried to collect information if an online/offline resize/shrink of the file system is possible and wrote it down in the table. There might be some errors – please feel free to correct me.
There are some Metalink Notes available which deal with the file system support matrix for different Linux Distributions:
- Supported File Systems on Red Hat / Oracle Enterprise Linux: Metalink Article 279069.1
- Supported File Systems on SuSE: Metalink Article 414673.1
As you can see from the list above Linux offers a vast amount of file systems whereas all other platforms offer only a limited number of file systems.
The choice of the “right” file system
Choosing the file system for operating systems like windows, HP-UX or AIX is quite straight forward: On Windows for single databases NTFS and FAT are available. FAT offers no journaling and is extremely limited in file size so NTFS is the only valid choice left. On “traditional” UNIXes like AIX and HP-UX there is often a default file system which offers at least journaling and (offline) resizing features and is therefore chosen. If you want to spend some extra money VxFS can be obtained for many platforms (e.g. Solaris, Windows) which offers some nice advanced features like online resizing.
But choosing the file system on Linux is quite difficult: There are many certified file systems for running oracle on Linux, for instance: ext2, ext3, ReiserFS (SuSE only), XFS (SuSE only), OCFS, OCFS2 or GFS. Eliminating file systems without journaling eliminates ext2 but still leaves us with a lot of file systems available depending on the used distribution:
- Red Hat / Oracle Enterprise Linux:
- SuSE Linux Enterprise Server
- ReiserFS v3.5 / v3.6
Judging the different file systems from the authors point of view:
- ReiserFS v3.x: The author used ReiserFS on SuSE for many years without problems. It it a stable file system just like ext3 which never lead to data loss caused by software errors. Unfortunately ReiserFS is only certified with oracle on SuSE SLES. In addition to that the future development of ReiserFS is unknown since the developer was sent to jail. Even Novell announced to move away from ReiserFS to ext3.
- XFS: Also used by the author but not for databases due to some bugs in XFS which might lead to database corruptions. Like ReiserFS XFS is ony certified with oracle on SuSE SLES. But XFS is fine for non-oracle systems.
- ext3: Successor of ext2 and “the” default file system on Linux – especially on Red Hat and Oracle Enterprise Linux (which is basically a Red Hat-clone). Ext3 offers features like metadata-journaling, growing (online from Red Hat 5 and OEL 5 onward, offline in older releases) and offline shrinking. Very stable but performs a full old-fashioned file system check on regular basis. Read more about this here.
- OCFS: Cluster file system for storing oracle data files in clustered environments (“RAC”). Quite old and had problems. Should not be used anymore and lacks support for new oracle versions.
- OCFS2: Successor of OCFS. Cant say much about OCFS2 here because i have used it rarely (actually in only one project). OCFS2 kernel modules should be updates on regular basis to catch up with the latest bug fixes. If there are problems with RAC and you are using OCFS2 Oracle will most probably encourage you to update your OCFS2 modules. Happy patching :-)
After looking at the facts collected and written down above from my point of view it turns out there is only ONE valid file system choice for running oracle on any Linux today: ext3. It is well maintained, offers journaling, resizing capabilities and is the most widely used file system on Linux and a common and stable choice for ANY distribution – Red Hat, OEL, SuSE, Asianux and so on.
The other file systems fail for several reasons: ReiserFS v3 lacks development and the future is not that clear. ReiserFS v4 is still not integrated into recent kernels and not certified for oracle yet and i doubt it ever will. ext2 lacks file system journaling and is therefore unacceptable for todays large disk sizes and impoves a limit of 2 TB per file system. XFS can corrupt your redo logs and OCFS/OCFS2 are mainly for clustered databases.
Currently there are two new file systems being developed: BTRFS and ext4.
btrfs is a project initiated by oracle which tries to implement a new and feature-rich file system. According to the official project page btrfs will offer:
- Extent based file storage (264 max file size)
- Very fast offline filesystem check
- Efficient incremental backup and FS mirroring
- Space efficient packing of small files
- Space efficient indexed directories
- Dynamic inode allocation
- Writable snapshots
- Subvolumes (separate internal filesystem roots)
- Object level mirroring and striping
- Checksums on data and metadata (multiple algorithms available)
- Strong integration with device mapper for multiple device support
- Online filesystem check
- Very fast offline filesystem check
- Efficient incremental backup and FS mirroring
With Andrew Morton btrfs has a powerful supporter. In one of his presentation he said:
„I am hoping that btrfs will save us. But as far as I know it is not getting as much external development support as it warrants – Merging btrfs into mainline might help here“
Since January 2009 btrfs is integrated into the official kernel. However it should not be used for productive environments yet. But it is good for playing with :-)
Contrary to btrfs ext4 is already marked as “stable” and was officially released in December 2008 as the successor of the widely used ext3 filesystem. It is integrated into the kernel since version 2.6.19 and several distributions (e.g. Ubuntu, Gentoo) start to use it as default filesystem. According to Wikipedia ext4 offers the following features:
- support volumes sizes up to 1 EiB and files with sizes up to 16 TiB
- Extent based file storage
- Backward compatibility to ext2 and ext3
- Persistent pre-allocation
- Delayed allocation
- Break 32,000 subdirectory limit – it is now 64,000
- Journal checksumming
- Online defragmentation
- Faster file system checking
- Multiblock allocator
- Improved timestamps (now measured in nanoseconds)
As of today none of the both filesystem are certified for running oracle on it. So you should not place your data files on it if you intent to get any support.
I expect support for ext4 in the near future. btrfs will most probably follow soon after the development finished implementing most of the features. From my point of view btrfs looks more promising than ext4 because it adds some volume manager functionalities (e.g. writeable snapshots, subvolumes) like ZFS already does on Solaris.
Furthermore i expect btrfs and ASM will be the most widely used technologies for storing oracle data files. But: If Oracle finishes buying SUN ZFS might be the third competitor.