Log event for server failover

We have Ignition (ver. 7.2.6) running in a redundant configuration. For some reason the primary server has been “hanging” and the backup takes over. The OS (Ubuntu Linux 11.04) is still operational, we can ping the server and access other applications. Looking at the running processes it looks to us like the Java process associated with the Ignition application has crashed.

First question is if anyone has seen this before?

The second relates to the Ignition server log events. We’re trying to determine exactly when this is happening. Unfortunately there are so many events in the log we need to search for the event when the backup can’t contact the primary and takes over. We’ve tried a couple of queries, does anyone know what the log event syntax is for the backup assuming control?

Thanks

There are a couple different things you could search for in either the master or backup wrapper.log files:

In the backup node wrapper.log file try:
“role=Backup, activity level=Active” or
“Server has likely become unavailable”

In the master node wrapper.log try:
“Error occured while communicating with backup client”

These phrases should get you rather close in the log files to when the master and backup nodes are losing communication with each other.

I guess there are numerous different reasons that this could be happening but they’d all just be speculation at this point. A couple questions:

Does the backup only come up for a couple of seconds and then go back down again or does the backup come up and continue to run? Is the master node really failing and going down completely or does it seem to fix itself rather quickly?

Thanks, that’s what I was looking for.

Dave, I am seeing a case such as what you referred to where the backup seems to take control for a few seconds and then control appears to begin again on the primary. What is this an indication of? We do seem to be having an issue with heap memory on this server and I’m currently awaiting an opportunity to switch to the G1GC garbage collection on the primary server as an experiment.