PRVF-5300: Failed to retrieve active version for CRS on this node when installing 11.2.0.2 DB on 11.2.0.3.0 Grid Infrastructure

I just played with 11.2.0.3.0 patchset on Linux x86_64 (in my testcase Oracle Enterprise Linux 5.6) and tried to install a 11.2.0.2.0 database on it. It fails with:

PRVF-5300: Failed to retrieve active version for CRS on this node


The error stack in the installation log is:

ID: oracle.install.commons.util.exception.DefaultErrorAdvisor:745
oracle.cluster.verification.VerificationException: An internal error occurred within cluster
verification framework

ERRORMSG(linux): PRVF-5300 : Failed to retrieve active version for CRS on this node
        at oracle.cluster.verification.ClusterVerification.getPreReqTasksForSIDBInst(ClusterVerification.java:615)
        at oracle.install.ivw.db.action.PrereqAction.getProductVerificationTasks(PrereqAction.java:111)
        at oracle.install.commons.base.interview.common.action.AbstractPrereqAction.execute
        (AbstractPrereqAction.java:86)
        at oracle.install.commons.flow.AbstractFlowExecutor.startAction(AbstractFlowExecutor.java:358)
        at oracle.install.commons.flow.AbstractFlowExecutor.enterVertex(AbstractFlowExecutor.java:571)
        at oracle.install.commons.flow.AbstractFlowExecutor.transition(AbstractFlowExecutor.java:333)
        at oracle.install.commons.flow.AbstractFlowExecutor.nextState(AbstractFlowExecutor.java:268)
        at oracle.install.commons.flow.AbstractFlowExecutor.nextViewState(AbstractFlowExecutor.java:227)
        at oracle.install.commons.flow.DefaultFlowNavigator.goForward(DefaultFlowNavigator.java:58)
        at oracle.install.commons.flow.jewt.FlowWizard$1.run(FlowWizard.java:125)
        at oracle.install.commons.flow.jewt.FlowWizard$TransitionManager$1.run(FlowWizard.java:101)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
        at java.util.concurrent.FutureTask.run(FutureTask.java:123)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:676)
        at java.lang.Thread.run(Thread.java:595)

The problem

I started the installer with debug enabled just add “-debug -logLevel finest >inst1.out 2>inst2.out”). The log files gave some insight:

[Version.getVersion:497]  version String is 11.2.0.3.0
[Version.getVersion:498]  new Version().toString is 11.2.0.2.0
[VerificationUtil.getSIHAReleaseVersionObj:4986]  Configuration Exception:
PRKC-1137 : Unable to find Version object with string value 11.2.0.3.0
[VerificationUtil.getCRSUser:1362]  Active Version = null

The related query command is

"GI_HOME/bin/crsctl query has releaseversion"

Obviously 11.2.0.2.0 installer has problems with the string “11.2.0.3.0”.

Solution #1

The most simple approach is to start the installer like this:

./runInstaller -ignorePrereq

With that the installer skips al pre-installation tests.

Solution #2

One simple approach was to created a wrapper around crsctl to report a version of 11.2.0.2.0 when querying releaseversion:

cd $GRID_HOME /bin
mv crsctl crsctl.orig

Now create a script “crsctl” with the following contents:

EXEC=/u01/app/oragrid/product/11.2.0.3.0/bin/crsctl.orig
case $1 in
query)
 echo "Oracle High Availability Services release version on the local node is [11.2.0.2.0]"
;;
*)
        $EXEC $*
;;
esac

 

You can start the database installation. During the verification steps the installer might report the Oracle Restart Registry as invalid. Just ignore it. The installation should now run fine.

Note that this bug is NOT related to OEL 5.6. It is the installer which cannot deal with the version string of the newer grid infrastructure. So you will face this error on OEL 6, RedHat and SuSE as well.

Dont forget to revert the changes after the installation!

After installation finished i was able to create a database using ASM without any problems. Registering the database into Oracle Restart also worked fine.

INFO: task blocked for more than 120 seconds.

When running some high workloads on UEK kernels on systems with a lot of memory you might see the following errors in /var/log/messages:

 

INFO: task bonnie++:31785 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bonnie++      D ffff810009004420     0 31785  11051               11096 (NOTLB)
ffff81021c771aa8 0000000000000082 ffff81103e62ccc0 ffffffff88031cb3
ffff810ac94cd6c0 0000000000000007 ffff810220347820 ffffffff80310b60
00016803dfd77991 00000000001312ee ffff810220347a08 0000000000000001
Call Trace:
[<ffffffff88031cb3>] :jbd:do_get_write_access+0x4f9/0x530
[<ffffffff800ce675>] zone_statistics+0x3e/0x6d
[<ffffffff88032002>] :jbd:start_this_handle+0x2e5/0x36c
[<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
[<ffffffff88032152>] :jbd:journal_start+0xc9/0x100
[<ffffffff88050362>] :ext3:ext3_write_begin+0x9a/0x1cc
[<ffffffff8000fda3>] generic_file_buffered_write+0x14b/0x675
[<ffffffff80016679>] __generic_file_aio_write_nolock+0x369/0x3b6
[<ffffffff80021850>] generic_file_aio_write+0x65/0xc1
[<ffffffff8804c1b6>] :ext3:ext3_file_write+0x16/0x91
[<ffffffff800182df>] do_sync_write+0xc7/0x104
[<ffffffff800a28b4>] autoremove_wake_function+0x0/0x2e
[<ffffffff80062ff0>] thread_return+0x62/0xfe
[<ffffffff80016a81>] vfs_write+0xce/0x174
[<ffffffff80017339>] sys_write+0x45/0x6e
[<ffffffff8005d28d>] tracesys+0xd5/0xe0

This is a know bug. By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing 120 seconds. This especially happens on systems with a lof of memory.

The problem is solved in later kernels and there is not “fix” from Oracle. I fixed this by lowering the mark for flushing the cache from 40% to 10% by setting “vm.dirty_ratio=10” in /etc/sysctl.conf. This setting does not influence overall database performance since you hopefully use Direct IO and bypass the file system cache completely.

ext4 file systems and the 16 TB limit – how to *solve* it

File systems do have limits. Thats no surprise. ext3 had a limit at 16 TB file system size. If you needed more space you´d have to use another file system for instance XFS or JFS or spilt the capacity into multiple mount points.

ext4 was designed to allow far more larger file systems than ext3. According to wikipedia ext4 has a maximum file system size of 1 EiB (approx. one exabyte or 1024 PB or 1024*1024 TB).

Now if you´d try to create one single large file system with ext4 on every linux distribution out there (including OEL 6.1; as of 18th August 2011) you will end up with:

[root@localhost ~]# mkfs.ext4 /dev/iscsi/test mke4fs 1.41.9 (22-Aug-2009)
mkfs.ext4: Size of device /dev/iscsi/test too big to be expressed in 32 bit susing a blocksize of 4096.

This post is about how to solve the issue.

Continue reading ext4 file systems and the 16 TB limit – how to *solve* it

When patching is not enough: Oracle 11g R2 on Solaris SPARC requires fresh base installation of Solaris 10 U6

While checking MOS i found an interesting note (ID 964976.1) which states:

Applying a kernel patch or a Solaris patch bundle is not the equivalent
to  installing the specific Solaris 10 "update 6" image. 11gR2 RDBMS software
is  only certified for a base install image of Solaris 10 update 6 or greater.

There is a FAQ (ID 971464.1) on this problem. Here it states:

Oracle/Sun has specifically started that "installing patches will not bring it  to Update 6".

and

It is only certified for a base install image of Solaris 10 Update 6  or greater, or an
upgraded image of an earlier Solaris 10 update to at least  Update 6 or greater. There are
only two methods to accomplish this " image".  Please see Question #9 for more details.

So keep this in mind when installing 11g R2 on Solaris SPARC.

Installing a paravirtualized guest using PXE and Kickstart on Oracle VM 2.2

The past few days i started working on Oracle VM 2.2.1.

The task i tried to accomplish was to install Oracle Enterprise Linux in a para-virtualized guest via PXE and Kickstart. Sounds not too complicated, does it? If you are familiar with VMWARE and know how easy it is to accomplish this task be advised: With Oracle VM it is complicated.

Continue reading Installing a paravirtualized guest using PXE and Kickstart on Oracle VM 2.2

11.2.0.2: two critical bugs

Just found two nice bugs in Metalink for 11.2.0.2.x:

Bug #1 (10205230) [Note ID 1318986.1]: ORA-00600 or DATA CORRUPTION in RAC Environments
when using shutdown mode "normal", "transactional" or "immediate" on 11.2.0.2.1 and 11.2.0.2.0.

 

This Bug is fixed in 11.2.0.2 PSU 2.

 

Now if you install 11.2.0.2 PSU 2 you might find the next major bug:

Bug #2 (12431716): Mutex waits may cause higher CPU usage in 11.2.0.2.2 PSU / GI PSU [ID 12431716.8]

Removing NFS Locks on Opensolaris / Nexenta (or handling ORA-27086: unable to lock file – already in use when using Oracle over NFS)

Facts

  • Oracle running on Linux / Solaris
  • Oracle data files and control file are stored on NFS
  • NFS server used Opensolaris or Nexenta

Symptom

ORA-00205: error in identifying controlfile, check alert log for more info
ORA-00210: cannot open the specified  control file
ORA-00202: control file:: '/u02/oradata/ORA11P/control02.ctl'
ORA-27086: unable to lock file - already in use
Linux-x86_64 Error: 11: Resource temporarily unavailable
Additional information: 8

Cause

Due to a database crash or unclean shutdown NFS locks were not properly release on storage side. You have to clean them manually in order to be able to startup the database again.

Solution

Step 1 – Remove NFS locks on Opensolaris / Nexenta side

root@nex2:/volumes# clear_locks oracle11
Clearing locks held for NFS client oracle11 on server nex2
clear of locks held for oracle11 on nex2 returned success

Note: It is NOT sufficient to enter the IP adress. You have to use the HOSTNAME here.

Step 2 – Mount & Open the database

oracle@oracle11:/u02/oradata/LIMSTEST> sqlplus / as sysdba

SQL*Plus: Release 11.2.0.2.0 Production on Thu Apr 28 20:53:19 2011
Copyright (c) 1982, 2010, Oracle.  All rights reserved.
Connected to:
Oracle Database 11g Release 11.2.0.2.0 - 64bit Production

SQL> alter database mount;
Database altered.
SQL> alter database open;
Database altered.

ORA-27154: post/wait create failed / ORA-27301: OS failure message: No space left on device when starting ASM or database instance

Today i came accross a very strange error. After rebooting one cluster node (which ran flawlessly before!) ASM instance came up fine but database instance failed with:

ORA-27154: post/wait create failed
ORA-27300: OS system dependent operation:semget failed with status: 28
ORA-27301: OS failure message: No space left on device
ORA-27302: failure occurred at: sskgpsemsper

It turned out the kernel settings were insufficent. The semaphore settings caused problems.

In /etc/sysctl.conf:

kernel.sem = 250 32000 100 128

This line needs to be changed to:

kernel.sem = 250 32000 100 256

Note the change of the last number from 128 to 256. After that applying the settings is done as root with:

sysctl -p

After that all instances came up just fine.

PSU Bundle Patch 3 for 11.2.0.2.0

Bundle Patch 3 for Oracle 11.2.0.2.0 is available.

For installation refer to these posts here and here.

Patch number is: 10387939

If youre worried about the fact the patch says it is for Exadata read this post from Oracle.

Money Quote:

Officially this Bundle Patch for Oracle Database 11.2.0.2 is titled “Exadata Database recommended patch” and got released yesterday. But I would recommend this one to all customers using 11.2.0.2 Grid Infrastructure, RAC and ASM.