Tuning Linux for Oracle
This is a short post on how to tune Linux for running Oracle.
It wont cover the background directly but i am trying to give some links with further informations.
I am planning to improve it over the time.
Currently this article covers:
- Partition Alignment
- Choosing a file system
- Optimizing ext3
- Enable Huge Pages
- Using Async and Direct IO
- Tune Swapping Priority
Align Partitions
What it is? When creating partitions you have to ensure your file system blocks match exactly with your storage block layout. There are some further information here (esp. for running VMWARE with any operating system) or this great article here.
On Linux checking for alignment is quite easy when using fdisk. Instead of using “fdisk -l” for querying all partitions use “fdisk -lu” for querying partitions. This will show partitions start SECTOR (remember every sector equals to 512 bytes and 4 KB on some recent disks) as well:
# fdisk -lu Disk /dev/sde: 2199.0 GB, 2199023255552 bytes 255 heads, 63 sectors/track, 267349 cylinders, total 4294967296 sectors Units = sectors of 1 * 512 = 512 bytes Device Boot Start End Blocks Id System /dev/sde1 63 4294961684 2147480811 83 Linux Disk /dev/sdf: 53.6 GB, 53687091200 bytes 64 heads, 32 sectors/track, 51200 cylinders, total 104857600 sectors Units = sectors of 1 * 512 = 512 bytes Device Boot Start End Blocks Id System /dev/sdf1 128 104857599 52428736 83 Linux
As you can see the first partitions on disk /dev/sde starts at sector 63 which is no way aligned. Partitions /dev/sdf1 is aligned cause it starts at sector 128 which is fine for all stripe configurations up to 64 KB stripe width.
Generally speaking i strongly recommend to align your partitions. If you dont have stripe widths larger than 64 KB starting at sector 128 is fine.
When using ASM you need to align at 1 MB, i.e. 2048 sectors.
The following shows how to implement it with fdisk AFTER creating a partitions and BEFORE creating a file system:
[root@moloch oradata]# fdisk /dev/sde Command (m for help): p Disk /dev/sde: 2199.0 GB, 2199023255552 bytes 255 heads, 63 sectors/track, 267349 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sde1 1 267349 2147480811 83 Linux Command (m for help): x Expert command (m for help): b Partition number (1-4): 1 New beginning of data (63-4294961684, default 63): 128
# fdisk -lu Disk /dev/sde: 2199.0 GB, 2199023255552 bytes 255 heads, 63 sectors/track, 267349 cylinders, total 4294967296 sectors Units = sectors of 1 * 512 = 512 bytes Device Boot Start End Blocks Id System /dev/sde1 128 4294961684 2147480811 83 Linux
Choose a file system
There are tremenous discussions which file system is the “best” on Linux. For running Oracle on Linux there is only one: ext3. This equals to Oracles recommendations (and my options as well).
Another options is – of course – ASM.
Creating the ext3 file system
When creating the file system for storing your database files you might want to change the ratio of created inodes per bytes. This is especially useful for shorten the periodic full file system check in ext3 (i outlined it in this post).
You can safely create a ext3 file system with only one inode per 1 MB space with:
mkfs.ext3 -T largefile <device>
Or even one inode per 4 MB with:
mkfs.ext3 -T largefile4 <device>
Remember you will need one inode per created file or directory. So i recommend this options only on file systems intended for usage by data files (this includes redo logs and even archive logs).
Creating a file system with fewer number of inodes will shorten your fsck times extemely. I did some tests with an 8 TB ext3 file system created normally, with “-T largefile” and with “-T largefile4” options. The standard created file system creates one inode per every 4 KB. fsck’ing an 8 TB file system filled to 50% with data files (all of 32 GB size) takes approx 7 hours! Checking the same file system created with “-T largefile” only takes 10 minutes. The “-T largefile4” needs approx 5 minutes.
Disable atime-Updates on ext3
When accessing a file or directory ext3 file system updates the file or directories last accessed timestamp. It does to for every read to every file. This has a unnecessary performance impact. To avoid this impact you can turn off these updates by adding
noatime,nodiratime
to your /etc/fstab or remounting the file system with:
mount -o remount,noatime,nodiratime,rw <path or device>
According to oracles recommendations you can safely turn this off in non-RAC environments. In RAC environments you need to check.
Enable Huge Pages
For using your memory optimally i recommend to enable and use Huge Pages as outlined in this post.
You can enable Huge Pages by adding the following to your /etc/sysctl.conf
vm.nr_hugepages = <number>
The <number> are the requested Huge Page to allocate. Note that each page is 2 MB in size. So for a 2 GB SGA you need approx 1000 Huge Pages.
More details can be found here.
Enable DirectIO and Asynchronous IO
By setting the parameter “filesystemio_options” in your init.ora you can enable either DirectIO, Asynchronous IO or both.
For enabling both (which i recommend on Linux) set:
filesystemio_options=setall
in your init.ora (or spfile).
Note that not all file systems support DirectIO or even Asynchronous IO. ext3 does support both.
Tune Swapping Priority
Linux does swap unused pages even if there is plenty of memory free. It does so by checking if a memory page (which is 4 kb in size) was accessed recently. If they were not accessed for some time the page gets swapped out to disk. The memory freed is used for other purposes – most often for the file system cache.
You can adjust the time a memory page gets swapped out by adjusting the “swappiness” value. The valid range is 0…100. “0” means: no swapping unless neccessaey; 100 means: if a page is not very frequently used/visited swap it out; default is 60.
I recommend to set this value quite low on a database server… e.g. 5 or even 1 by adding to your “/etc/sysctl.conf”:
vm.swappiness=5
Note that Huge Pages are not swappable thus remain always in memory.
noatime is a superset of nodiratime (in other words, you only need to use noatime)
Hi Ronny,
thanks for this very valuable post. We did run tests with ext4, ext3 and xfs on a 16TB single LUN from a HP 3PAR V400.
Your findings still hold true for ext4, however the gap between 4k (default) and largefile/largefile4 was not as big as with ext3.
To knock down the fs, we used dd (“dd if=/dev/urandom of=/dev/mapper/fsck_thick bs=4k count=512”). Of course you could also start a few processes (I recommend mixing buffered and direct I/o ones), send them to the background and reboot the server using sysrq trigger, or when running in a VM like Oracle VM do a “xm destroy”.
To check the the potential impact on random i/o performance (requested by customer), we did run some tests with fio. These showed, that there was no performance improovement or degradation due to the change to largefile/largefile4.
Regards
Efstathios
True. But the time needed to check the filesystem is much shorter with largefile/largefile4….