Home > Oracle in general > INFO: task blocked for more than 120 seconds.

INFO: task blocked for more than 120 seconds.

When running some high workloads on UEK kernels on systems with a lot of memory you might see the following errors in /var/log/messages:

 

INFO: task bonnie++:31785 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bonnie++      D ffff810009004420     0 31785  11051               11096 (NOTLB)
ffff81021c771aa8 0000000000000082 ffff81103e62ccc0 ffffffff88031cb3
ffff810ac94cd6c0 0000000000000007 ffff810220347820 ffffffff80310b60
00016803dfd77991 00000000001312ee ffff810220347a08 0000000000000001
Call Trace:
[<ffffffff88031cb3>] :jbd:do_get_write_access+0x4f9/0x530
[<ffffffff800ce675>] zone_statistics+0x3e/0x6d
[<ffffffff88032002>] :jbd:start_this_handle+0x2e5/0x36c
[<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
[<ffffffff88032152>] :jbd:journal_start+0xc9/0x100
[<ffffffff88050362>] :ext3:ext3_write_begin+0x9a/0x1cc
[<ffffffff8000fda3>] generic_file_buffered_write+0x14b/0x675
[<ffffffff80016679>] __generic_file_aio_write_nolock+0x369/0x3b6
[<ffffffff80021850>] generic_file_aio_write+0x65/0xc1
[<ffffffff8804c1b6>] :ext3:ext3_file_write+0x16/0x91
[<ffffffff800182df>] do_sync_write+0xc7/0x104
[<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
[<ffffffff80062ff0>] thread_return+0x62/0xfe
[<ffffffff80016a81>] vfs_write+0xce/0x174
[<ffffffff80017339>] sys_write+0x45/0x6e
[<ffffffff8005d28d>] tracesys+0xd5/0xe0

This is a know bug. By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing 120 seconds. This especially happens on systems with a lof of memory.

The problem is solved in later kernels and there is not “fix” from Oracle. I fixed this by lowering the mark for flushing the cache from 40% to 10% by setting “vm.dirty_ratio=10” in /etc/sysctl.conf. This setting does not influence overall database performance since you hopefully use Direct IO and bypass the file system cache completely.

Categories: Oracle in general Tags:
  1. June 20th, 2012 at 07:24 | #1

    Thank you very much. It really helped me.

  2. Maurizio Marini
    August 2nd, 2012 at 08:20 | #2

    doing “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message”
    does disable the message, only, isn’t ?
    it may be things worst, i argue
    what do u suggest about that?

    • Ronny Egner
      August 23rd, 2012 at 10:42 | #3

      Yes it does but the message has a sense…. so disabling is not a good idea.

  3. Andreas Buschka
    September 14th, 2012 at 10:02 | #4

    Thank you, this solved my problem on a busy Enterprise Manager Cloud Control 12c2+Oracle Database 11.2.0.3 system, you are a life saver!

  4. November 5th, 2013 at 10:34 | #5

    Thank you very much!

  5. Alex
    November 15th, 2013 at 09:56 | #6

    Thanks ! This work :-)

  6. Karti
    July 24th, 2014 at 14:01 | #7

    Can you please mention the exact kernel version in which this bug is fixed?

    • Ronny Egner
      August 5th, 2014 at 10:01 | #8

      Hi Kati,

      this is not really a bug. It basically says that the kernel is blocked writing data to disk.

      The solution implemented in new kernels is to lower the “dirty_ratio” parameter down from 40% to a more realistic value like 5 or so.

      You can implement that yourself by setting in your /etc/sysctl.conf the following value:

      vm.dirty_ratio=5
      # for 5 percent of the whole memory

      and activate it with “sysctl -p”.

  7. oldhorizon
    October 30th, 2014 at 11:19 | #9

    Hi Ronny,

    I am facing this issue on 2.6.39-300.17.3.el6uek.x86_64 kernel. The default dirty_ratio is

    vm.dirty_ratio = 20

    would it help if I reduce it to 10 or 5 ?

    • Ronny Egner
      November 19th, 2014 at 14:12 | #10

      Hi,

      yes, very likely. You also want to use Direct IO wherere possible to bypass the file system cache.

  8. Michael Lush
    February 10th, 2015 at 14:56 | #11

    When you said “The problem is solved in later kernels” back in 2011 how much later were on Ubuntu 12.04 with kernal 3.8 and are having this issue.

    • Ronny Egner
      February 17th, 2015 at 14:36 | #12

      Then you either want to use Direct I/O or lower the value of vm.dirty_ratio (e.g. to 1) to have less data to flush….

  9. Martin
    March 3rd, 2015 at 11:25 | #13

    What do you mean in detail with Direct I/O. Mounting the filesystems with it? Having same problems on Debian + VmWare during backup of the databases (once per month or less). I thought oracle is good enough to call there own sources with direct I/O?! Do you know a article or something where I can read about it?

    Thanks anyway, I try to reduce dirty_ratio, maybe thats enough.

    • Ronny Egner
      July 29th, 2016 at 07:07 | #14

      You have to activate DIRECT IO in the database with a database parameter. On Linux there is no “forcedirectio” mount option as in Solaris.

  10. Kamal
    June 23rd, 2015 at 10:14 | #15

    Hi,

    We are using RHEL 6.2 32 bit (Physical Dell server)and kerner is 2.6.32-220.el6.i686. We face same kind of issues on this server, i have checked /etc/sysctl.conf its doesnt contain any line “vm.dirty_ratio” . Please let me know whats the workaround for the same.

    • Ronny Egner
      July 29th, 2016 at 07:06 | #16

      Hi,

      just put in the line and activate it. That´s all.

  11. Tony Godshall
    June 23rd, 2015 at 20:46 | #17

    I’ve seen this as recently as 3.16 kernels. It has to do with heavy write-loads, not database in particular. In this case I triggered it by dd’ing in 256 megabyte chunks.

    # grep . /etc/issue
    Debian GNU/Linux 8 \n \l
    # uname -a
    Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt9-3~deb8u1 (2015-04-24) x86_64 GNU/Linux

    It occurs on heavy write load. Last time I saw this, I fixed it by changing the I/O scheduleing from cfd to deadline.

    # for f in /sys/devices/*/*/*/*/*/*/block/sda/queue/scheduler; do echo deadline > $f; done

    And I added a script to that effect to my rc.local

    Tony

  12. Arnold
    April 26th, 2017 at 06:16 | #18

    After I edited the /etc/sysctl.conf. Then reboot. It says Can’t connect to default. Skipping. Waiting for Shutdown of managed resources…… . . .

  1. August 4th, 2012 at 21:17 | #1
  2. September 7th, 2012 at 06:41 | #2
  3. October 5th, 2012 at 08:33 | #3
  4. November 5th, 2013 at 10:39 | #4
  5. February 27th, 2014 at 05:14 | #5