Just in case of you will be facing this...
22 January, 2009
We hitted a nasty bug on 10.2.0.4 RAC and HP-UX 11.23 (Itanium). Once for a some time number of opened files reaches a system limit (defined by nfile). Current number of opened files you can check by using "glance" command ("system tables report" view) or by using "lsof" command which provides more detailed output. According to output from lsof we found that racgimon process has plenty of opened file descriptors on
$ORACLE_HOME/dbs/hc_.dat
and this number increases every 60 seconds. At the same time new error message is written to $ORACLE_HOME/log/dwh1/racg/imon_.log
file.
Here is text of error message:
2009-01-15 22:24:13.124: [ RACG][15][28099][15][ora...inst]:
GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13
Is there a patch? Yes, this bug is known (see Doc ID 739557.1) and of course there is a patch for this bug: #7298531.
Now you may ask so why you did this post?
Answer is pretty simple because this patch may not work due some circumstances.
And during try (due those circumstances) to use apply this patch CRS will not start after patch has been applied.
Also rollback of this patch is not possible without tweaking of prerootpatch.sh script because this script expects correctly running CRS.
In our case we still waiting for a working patch for our environment.
Is there any workaround?
Yes, at least two workarounds are possible.
1. Racgimon killer
RAC Global Instance Monitor aka racgimon process is responsible for clusterwide health check. If this process will die it will be respawned/restarted again.
Using this workaround we set the treshold for opened file descriptors by racgimon and if this limit was reached racgimon will be killed thus all file descriptors used by racgimon will be closed. So all we need is to create shell script and schedule it in cron (one execution for day is quite sufficient).
#!/usr/bin/bash
FD_TRESHOLD=20000 # Treshold for file descriptors opened by racgimon
LSOF=/usr/local/bin/lsof # location of lsof command
FD_CURRENT=`$LSOF -c racgimon | wc -l`
RACGIMON_PIDS=`ps -aef | grep racgimon | grep -v grep | awk '{print $2}'`
if [ $FD_CURRENT -gt $FD_TRESHOLD ]; then
for pid in $RACGIMON_PIDS; do
kill -9 $pid
done
fi
2. Remove instances from CRS applications
Stop running instance and execute following command:
srvctl remove instance -d DBNAME -i INSTANCE1
Start the instance.
Repeat this step for other instances in the cluster.
Note: From this moment, you can't use "srvctl" to start your instances, your instances will not startup automatically after reboot and instance(s) will disapear from "crs_stat" output.
Comments
New comment