Home > Oracle in general > Tuning Linux for Oracle

Tuning Linux for Oracle

This is a short post on how to tune Linux for running Oracle.

It wont cover the background directly but i am trying to give some links with further informations.

I am planning to improve it over the time.

Currently this article covers:

  • Partition Alignment
  • Choosing a file system
  • Optimizing ext3
  • Enable Huge Pages
  • Using Async and Direct IO
  • Tune Swapping Priority

Align Partitions

What it is? When creating partitions you have to ensure your file system blocks match exactly with your storage block layout. There are some further information here (esp. for running VMWARE with any operating system) or this great article here.

On Linux checking for alignment is quite easy when using fdisk. Instead of using “fdisk -l” for querying all partitions use “fdisk -lu” for querying partitions. This will show partitions start SECTOR (remember every sector equals to 512 bytes and 4 KB on some recent disks) as well:

# fdisk -lu

Disk /dev/sde: 2199.0 GB, 2199023255552 bytes
255 heads, 63 sectors/track, 267349 cylinders, total 4294967296 sectors
Units = sectors of 1 * 512 = 512 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sde1              63  4294961684  2147480811   83  Linux

Disk /dev/sdf: 53.6 GB, 53687091200 bytes
64 heads, 32 sectors/track, 51200 cylinders, total 104857600 sectors
Units = sectors of 1 * 512 = 512 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sdf1             128   104857599    52428736   83  Linux

As you can see the first partitions on disk /dev/sde starts at sector 63 which is no way aligned. Partitions /dev/sdf1 is aligned cause it starts at sector 128 which is fine for all stripe configurations up to 64 KB stripe width.

Generally speaking  i strongly recommend to align your partitions. If you dont have stripe widths larger than 64 KB starting at sector 128 is fine.

When using ASM you need to align at 1 MB, i.e. 2048 sectors.

The following shows how to implement it with fdisk AFTER creating a partitions and BEFORE creating a file system:

[root@moloch oradata]# fdisk /dev/sde

Command (m for help): p

Disk /dev/sde: 2199.0 GB, 2199023255552 bytes
255 heads, 63 sectors/track, 267349 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sde1               1      267349  2147480811   83  Linux

Command (m for help): x

Expert command (m for help): b
Partition number (1-4): 1
New beginning of data (63-4294961684, default 63): 128
# fdisk -lu

Disk /dev/sde: 2199.0 GB, 2199023255552 bytes
255 heads, 63 sectors/track, 267349 cylinders, total 4294967296 sectors
Units = sectors of 1 * 512 = 512 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sde1             128  4294961684  2147480811   83  Linux

Choose a file system

There are tremenous discussions which file system is the “best” on Linux. For running Oracle on Linux there is only one: ext3. This equals to Oracles recommendations (and my options as well).

Another options is – of course – ASM.

Creating the ext3 file system

When creating the file system for storing your database files you might want to change the ratio of created inodes per bytes. This is especially useful for shorten the periodic full file system check in ext3 (i outlined it in this post).

You can safely create a ext3 file system with only one inode per 1 MB space with:

mkfs.ext3 -T largefile <device>

Or even one inode per 4 MB with:

mkfs.ext3 -T largefile4 <device>

Remember you will need one inode per created file or directory. So i recommend this options only on file systems intended for usage by data files (this includes redo logs and even archive logs).

Creating a file system with fewer number of inodes will shorten your fsck times extemely. I did some tests with an 8 TB ext3 file system created normally, with “-T largefile” and with “-T largefile4” options. The standard created file system creates one inode per every 4 KB. fsck’ing an 8 TB file system filled to 50% with data files (all of 32 GB size) takes approx 7 hours! Checking the same file system created with “-T largefile” only takes 10 minutes. The “-T largefile4” needs approx 5 minutes.

Disable atime-Updates on ext3

When accessing a file or directory ext3 file system updates the file or directories last accessed timestamp. It does to for every read to every file. This has a unnecessary performance impact. To avoid this impact you can turn off these updates by adding

noatime,nodiratime

to your /etc/fstab or remounting the file system with:

mount  -o remount,noatime,nodiratime,rw  <path or device>

According to oracles recommendations you can safely turn this off in non-RAC environments. In RAC environments you need to check.

Enable Huge Pages

For using your memory optimally i recommend to enable and use Huge Pages as outlined in this post.

You can enable Huge Pages by adding the following to your /etc/sysctl.conf

vm.nr_hugepages = <number>

The <number> are the requested Huge Page to allocate. Note that each page is 2 MB in size. So for a 2 GB SGA you need approx 1000 Huge Pages.

More details can be found here.

Enable DirectIO and Asynchronous IO

By setting the parameter “filesystemio_options” in your init.ora you can enable either DirectIO, Asynchronous IO or both.

For enabling both (which i recommend on Linux) set:

filesystemio_options=setall

in your init.ora (or spfile).

Note that not all file systems support DirectIO or even Asynchronous IO. ext3 does support both.

Tune Swapping Priority

Linux does swap unused pages even if there is plenty of memory free. It does so by checking if a memory page (which is 4 kb in size) was accessed recently. If they were not accessed for some time the page gets swapped out to disk. The memory freed is used for other purposes – most often for the file system cache.

You can adjust the time a memory page gets swapped out by adjusting the “swappiness” value. The valid range is 0…100. “0” means: no swapping unless neccessaey; 100 means: if a page is not very frequently used/visited swap it out; default is 60.

I recommend to set this value quite low on a database server… e.g. 5 or even 1 by adding to your “/etc/sysctl.conf”:

vm.swappiness=5

Note that Huge Pages are not swappable thus remain always in memory.

Categories: Oracle in general Tags:
  1. me
    October 28th, 2012 at 06:51 | #1

    noatime is a superset of nodiratime (in other words, you only need to use noatime)

  2. Efstathios Efstathiou
    August 30th, 2013 at 11:32 | #2

    Hi Ronny,

    thanks for this very valuable post. We did run tests with ext4, ext3 and xfs on a 16TB single LUN from a HP 3PAR V400.

    Your findings still hold true for ext4, however the gap between 4k (default) and largefile/largefile4 was not as big as with ext3.

    To knock down the fs, we used dd (“dd if=/dev/urandom of=/dev/mapper/fsck_thick bs=4k count=512”). Of course you could also start a few processes (I recommend mixing buffered and direct I/o ones), send them to the background and reboot the server using sysrq trigger, or when running in a VM like Oracle VM do a “xm destroy”.

    To check the the potential impact on random i/o performance (requested by customer), we did run some tests with fio. These showed, that there was no performance improovement or degradation due to the change to largefile/largefile4.

    Regards

    Efstathios

    • Ronny Egner
      September 28th, 2013 at 08:41 | #3

      True. But the time needed to check the filesystem is much shorter with largefile/largefile4….

  1. July 23rd, 2010 at 17:18 | #1
  2. July 26th, 2010 at 08:52 | #2