Ivan Kartik - Oracle and Linux Blog - - Just in case of you will be facing this...

We hitted a nasty bug on 10.2.0.4 RAC and HP-UX 11.23 (Itanium). Once for a some time number of opened files reaches a system limit (defined by nfile). Current number of opened files you can check by using "glance" command ("system tables report" view) or by using "lsof" command which provides more detailed output. According to output from lsof we found that racgimon process has plenty of opened file descriptors on

$ORACLE_HOME/dbs/hc_.dat

and this number increases every 60 seconds. At the same time new error message is written to

$ORACLE_HOME/log/dwh1/racg/imon_.log

file. Here is text of error message:


2009-01-15 22:24:13.124: [    RACG][15][28099][15][ora...inst]:
GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13

Is there a patch? Yes, this bug is known (see Doc ID 739557.1) and of course there is a patch for this bug: #7298531. Now you may ask so why you did this post? Answer is pretty simple because this patch may not work due some circumstances. And during try (due those circumstances) to use apply this patch CRS will not start after patch has been applied. Also rollback of this patch is not possible without tweaking of prerootpatch.sh script because this script expects correctly running CRS. In our case we still waiting for a working patch for our environment. Is there any workaround? Yes, at least two workarounds are possible. 1. Racgimon killer RAC Global Instance Monitor aka racgimon process is responsible for clusterwide health check. If this process will die it will be respawned/restarted again. Using this workaround we set the treshold for opened file descriptors by racgimon and if this limit was reached racgimon will be killed thus all file descriptors used by racgimon will be closed. So all we need is to create shell script and schedule it in cron (one execution for day is quite sufficient).


#!/usr/bin/bash

FD_TRESHOLD=20000           # Treshold for file descriptors opened by racgimon
LSOF=/usr/local/bin/lsof    # location of lsof command
FD_CURRENT=`$LSOF -c racgimon | wc -l`
RACGIMON_PIDS=`ps -aef | grep racgimon | grep -v grep | awk '{print $2}'`

if [ $FD_CURRENT -gt $FD_TRESHOLD ]; then
    for pid in $RACGIMON_PIDS; do
        kill -9 $pid
    done
fi

2. Remove instances from CRS applications Stop running instance and execute following command:


   srvctl remove instance -d DBNAME -i INSTANCE1

Start the instance. Repeat this step for other instances in the cluster. Note: From this moment, you can't use "srvctl" to start your instances, your instances will not startup automatically after reboot and instance(s) will disapear from "crs_stat" output.

Comments

hola26 February, 2009Gracias por tu ayuda, ha sido en serio efectiva en mi trabajo.

george wang 26 March, 2009If i kill the racgimon process,will it crash the db or crs? thank you

Ivan27 March, 2009No, DB and CRS remain up & running.

wolfee02 April, 2009I'm having a similar problem on RHEL AS 4 u7 kernel 2.6.9-67.0.15.ELsmp. The only difference I can see here and everywhere else I've searched is that these posts reference "Not enough space" where my issue states "Invalid argument". I know Oracle support is going to suggest this patch and after reading this, I don't know if I want to go that route. I also don't want to remove the instances and basically kill racgimon nightly. Very frustrating at this point. I have to bring the database down every four days to prevent a crash. Do any other open files spawn because of this? Node2 is showing 131 open files for racgimon but over 50k open files on the server "cat /proc/sys/fs/file-nr". Node1 is showing over 7k open files for racgimon, but over 40k open files on the server. There are about 8700 datafiles in the database, should have about 11k open files on the server.

About

Categories

Latest posts

Pages

Just in case of you will be facing this...

Comments

New comment