Delayed Transaction Groups When Switching Back to Master

Jonathan · August 24, 2011, 8:50pm

I am experiencing a delay in our database tables that are being logged by a transaction group when our redundancy back up switches back to the master. There does not seem to be any delay when the master goes down and the backup node takes over. I am assuming this is because the backup is set to warm. If I am correct is there a way to make the master wait till it is reading values from all the tags before becoming active? or at least delay the fail over back to the master?

my group is set to log every 5 minutes. when i stop the Ignition service on the master node the backup then takes over. I wait for the backup node to write a record to the table, which is within 5 minutes of the master going down. Then i start the service backup on the master, but a record does not seem to get written to the table for at least 15 minutes. I am losing a transaction or two.

Thank you,

Jonathan · August 31, 2011, 5:17pm

What would happen if I increased the "Startup Connection Allowance" from 30000 to 120000? Would this delay the backup node from switching back to the master node so quickly?

I do not fully understand what this paragraph is trying to say in the manual.

Colby.Clegg · August 31, 2011, 5:42pm

It’s possible. That’s basically how long the master will wait after starting up before deciding on whether or not it should be active. The idea is to give the backup a chance to connect first- because in certain modes, if the backup is running, you wouldn’t want the master to just start up right away.

We made a small change to the way this stuff works 2 weeks ago (it will be in 7.2.9, the beta is up now), in order to address what seemed to be a problem where the master was never taking over again. Unfortunately, we weren’t able to find a smoking gun, so I can’t say for certain that it fixes the problem you’re running into.

If possible, try setting up your test with the latest beta. If you don’t want to do that, and this problem is repeatable, perhaps you could turn the redundancy loggers to trace, try the test, and then send me the logs. Since you’re restarting the system, you would have to set the logger level in the config file in order for it to persist. Let me know if you want to go this route and I’ll give you more info.

Regards,

Jonathan · August 31, 2011, 6:55pm

I would like to log my test scenario so we can see what is going on. What do I have to do to set the log to persist?

Thank you,

Colby.Clegg · August 31, 2011, 11:22pm

Hi,

Track down the log4j.properties file in “{InstallDir}/contexts/main”, and add the following line:

log4j.logger.Redundancy.StateMonitoring=DEBUG

You can do this on both systems if you want, but really you only need to do it on the master- since that’s the one you’ll be restarting. On the backup, you can just go to the console section and search for “StateMonitoring”, and set the highest one you can find to “DEBUG”.

Now restart the master gateway and try to catch it in the act. When you do, grab the wrapper.log files from both servers and send them in, or post them here.

I tried to mock this up with 7.2.9, but everything worked as expected. One other thing that might be useful to include: the redundancy.xml files from both machines (located in the same “contexts/main” directory). This will let me see how the settings are configured.

Regards,

Jonathan · September 1, 2011, 4:47pm

Remember my transaction groups are set to execute every 5 minutes. Here is a list of events

9:03 AM master node transaction group was executed and row was written to database table
9:05 AM I stopped the Ignition service on the master server
9:05:55 AM backup node transaction group triggered and row was written to database table
9:07 AM I started the Ignition service on the master server
9:18 AM master node transaction group was executed and row was written to database table

Looking at the list of events you can see the first issue is the backup node executed the transaction group as soon as the backup node became active around 9:05:55 AM. This was only 2-3 minutes after the master node transaction group executed not 5 minutes.

How come the backup node does not know the last time the transaction group was executed on the master node? Is it possible to make the backup node aware of the last transaction time on the master so it will continue to execute at the same time interval?

When I started the Ignition service on the master node at 9:07 AM the master became active, but the transaction group did not execute until 11 minutes later. I can see from the logs that the trigger did not fire off because it had bad quality for 11 minutes. Any suggestion on how to get the quality to became good quicker so I do not lose a transaction?

Thank you,
Backup-wrapper.log (1.47 MB)
Master-wrapper.log (118 KB)

Colby.Clegg · September 1, 2011, 11:06pm

We’re looking into what could cause the subscription to take so long to return good values. I suspect that that is more of the problen than the master not becoming “active”, since, as you noted, the master seemed to become active pretty quickly in the logs.

In regards to timing, details like when tasks executed are not communicated between the nodes. So, the group will execute as quickly as it can. I really don’t think it’s feasible for us to try to coordinate this, though it might be possible for you to accomplish this through a more complex trigger. Static SQLTag values are coordinated between the gateways, so if you had a SQLTag that you wrote to to indicate the last execution, and then used that in your trigger, in theory it should work more like what you want, as it will be updated on the backup as it’s modified.

I’ll let you know what we find,

Jonathan · September 2, 2011, 6:18pm

I wanted to note something interesting. if i copy my transaction group and change the timing to 5 seconds. When i switch back and forth between master and backup there is only about 10 seconds lost not 10-15 minutes. So the problem still seems to being pointed towards the tags not getting values fast enough on the 5 minute transaction groups.

Colby.Clegg · September 2, 2011, 6:21pm

Yes, I’m guessing that the problem will turn up to come in some multiple of the subscription rate. It seems like it’s taking 2-3 subscription cycles for values to come in.

Regards,