Issue with Redundancy

Using V7.5.7
We have an application at a customer’s site that has redundancy enabled. After a power failure on May 11th, the customer complained that hourly logged transaction group data was now showing up twice at the top of each hour. PCs are protected by UPS, but the power outage was longer than the UPS backup time.
Since this is an out of town job, I had issued with their IT department getting my remote desktop connection re-established. Finally got online today and noticed that data logged at the top of the hour, plus around another 1minute 45 seconds later , another transaction was logged.
Eventually traced problem to the Backup was running and logging data to the database as well, eventhough the Master was running.
Checking the status on the master, only the master node was displayed, checking status on the backup only the backup node was displayed
I had to log onto the Backup and toggle its mode from Backup ->Independant → Backup in order to stop it from logging to the database and have the status of Master → Backup show up in Status.
The attached snaphot shows the logged data and where I corrected the issue just before 16:00.

How does the backup node detect the master after a power failure? I have the Master Recovery set to Automatic.


question would be, what happens if the backup powers up first and runs the project as backup , but with no connection to the master, then the master powers up?

It appears in my case this happned and the master and the backup were both running and logging transaction group data to the database until I toggled the backup.

Hi,

It sounds like, for some reason, the backup wasn’t trying to reconnect to the master. It should try to connect according to the “reconnect period” defined in the redundancy settings on the backup. The other possibility is that it was connecting, but then some error was occurring that was preventing it from finishing.

Normally, it works like this: when the backup starts up, it is not immediately active. It starts trying to connect to the master. When the “Startup connection allowance” expires, it will become active, if it hasn’t successfully connected.

After that, it tries to connect to the master. Once the connection is established, in automatic mode, the master will tell the backup to go to non-active. (On a side note, when they connect, the backup will look at the startup time of the master, and decide what to do with cached history, if using partial history mode).

So you see, by not seeing the other node represented on either, they just weren’t connecting. If there’s any way to get to the logs, they would be useful to see. Either node would probably be sufficient.

Regards,

[quote=“Colby.Clegg”]Hi,

It sounds like, for some reason, the backup wasn’t trying to reconnect to the master. It should try to connect according to the “reconnect period” defined in the redundancy settings on the backup. The other possibility is that it was connecting, but then some error was occurring that was preventing it from finishing.

Normally, it works like this: when the backup starts up, it is not immediately active. It starts trying to connect to the master. When the “Startup connection allowance” expires, it will become active, if it hasn’t successfully connected.

After that, it tries to connect to the master. Once the connection is established, in automatic mode, the master will tell the backup to go to non-active. (On a side note, when they connect, the backup will look at the startup time of the master, and decide what to do with cached history, if using partial history mode).

So you see, by not seeing the other node represented on either, they just weren’t connecting. If there’s any way to get to the logs, they would be useful to see. Either node would probably be sufficient.

Regards,[/quote]

Colby,

attached is screen shot via Remote Desktop at time of incident from the master’s console:


Hi,

What time was the power outage? From that log, it actually looks like perhaps the master’s power didn’t go out, but that just the backup (or the switch between them?) did. One thing it clearly shows is that the backup wasn’t attempted to connect, or at least wasn’t getting through.

The log on the backup actually would be helpful, along with perhaps a screenshot of the redundancy settings. If it started working after you went to independent-backup again, either something wasn’t started properly, or had started up incorrectly (it’s a long shot, but are there multiple network cards on the backup machine? Is the bind mode in the redundancy settings set to “auto”? I’m trying to imagine if perhaps it could start up and bind to the wrong interface, and not be able to see the master until it started again… the logs might be able to help with this, though).

Regards,

[quote=“Colby.Clegg”]Hi,

What time was the power outage? From that log, it actually looks like perhaps the master’s power didn’t go out, but that just the backup (or the switch between them?) did. One thing it clearly shows is that the backup wasn’t attempted to connect, or at least wasn’t getting through.

The log on the backup actually would be helpful, along with perhaps a screenshot of the redundancy settings. If it started working after you went to independent-backup again, either something wasn’t started properly, or had started up incorrectly (it’s a long shot, but are there multiple network cards on the backup machine? Is the bind mode in the redundancy settings set to “auto”? I’m trying to imagine if perhaps it could start up and bind to the wrong interface, and not be able to see the master until it started again… the logs might be able to help with this, though).

Regards,[/quote]

Screen shots of Backup console at same time:















Hi,

Unfortunately, there’s not much to go on there, though certainly the error about the last packet is strange, since it says the last one was received 8.7 days prior. Although, I guess that with the connection pool, maybe that could have just been the last time the backup started or went active?

I think I’d have to have the actual wrapper.log file from the backup to get a better sense of what it might have been doing. It’s possible that something was deadlocked, although at this point I don’t really have much to point to that, and it would be hard to prove. If, by chance, you find it in this state again, running the “thread dump” from the gateway control utility would be good, before restarting the backup.

Regards,