Redundant Pair Backup Resync Repeating

pturmel · November 2, 2022, 4:24pm

I have a newly upgraded redundant pair (v7.9.18 => v8.1.21) whose backup server won't synchronize. A snip from the log where it decides to do a full resync looks like this:

INFO | jvm 3 | 2022/11/02 11:40:24 | I [R.S.BackupStateManager ] [15:40:24]: Redundancy state changed: Role=Backup, Activity level=Warm, Project state=Good, History level=Full redundant-provider=system
INFO | jvm 3 | 2022/11/02 11:40:24 | I [c.i.i.g.o.s.IgnitionOpcUaServer$OpcUaRedundancyListener] [15:40:24]: redundancyStateChanged: Role=Backup, Activity level=Warm, Project state=Good, History level=Full
INFO | jvm 3 | 2022/11/02 11:41:05 | I [R.S.SyncManager ] [15:41:04]: [internaldb] Will perform full pull. Reason: major change detected between nodes (sync provider UUIDs did not match). gan-remote-gateway-name=Ignition-801SCADA-Master, redundant-provider=internaldb
INFO | jvm 3 | 2022/11/02 11:41:05 | I [R.S.SyncManager ] [15:41:04]: Sync provider 'internaldb' requires full sync and will result in a restart. gan-remote-gateway-name=Ignition-801SCADA-Master, redundant-provider=internaldb
INFO | jvm 3 | 2022/11/02 11:41:05 | I [R.S.BackupStateManager ] [15:41:04]: Redundancy state changed: Role=Backup, Activity level=Warm, Project state=OutOfDate, History level=Full gan-remote-gateway-name=Ignition-801SCADA-Master, redundant-provider=system
INFO | jvm 3 | 2022/11/02 11:41:05 | I [c.i.i.g.o.s.IgnitionOpcUaServer$OpcUaRedundancyListener] [15:41:04]: redundancyStateChanged: Role=Backup, Activity level=Warm, Project state=OutOfDate, History level=Full gan-remote-gateway-name=Ignition-801SCADA-Master
INFO | jvm 3 | 2022/11/02 11:41:05 | I [R.S.BackupStateManager ] [15:41:04]: Peer node information has been updated: RedundancyNode(address=172.16.70.8, httpAddresses=[http://172.26.70.8:8088], activityLevel=Active, projectState=OutOfDate) gan-remote-gateway-name=Ignition-801SCADA-Master, redundant-provider=system
INFO | jvm 3 | 2022/11/02 11:41:12 | I [R.S.BackupStateManager ] [15:41:12]: System restore file successfully downloaded and processed. System will now restart. gan-remote-gateway-name=Ignition-801SCADA-Master, redundant-provider=system
INFO | jvm 3 | 2022/11/02 11:41:12 | I [IgnitionGateway ] [15:41:12]: Restarting JVM gan-remote-gateway-name=Ignition-801SCADA-Master
INFO | jvm 3 | 2022/11/02 11:41:13 | I [IgnitionGateway ] [15:41:13]: Ignition[state=RUNNING] ContextState = STOPPINGAfter repeating five times, the wrapper kills the backup server's service.

I have ticket #63436 started, but I was wondering if anyone else has seen this (or similar).

The wrapper kills off the backup gateway's service after it loops five times.

paul-griffith · November 2, 2022, 4:36pm

How large is the state-transfer .gwbk in the data dir on the backup? If you open it, does it look well-formed?

pturmel · November 2, 2022, 4:44pm

213 MB.

Ark opens it fine. An actual gateway backup from the master is ~ 77 MB.

pturmel · November 2, 2022, 4:49pm

Opening the db_backup_sqlite.idb file is fine, and passes an integrity check in my SQLite DB Browser.

paul-griffith · November 2, 2022, 4:52pm

Can you manually restore the state transfer via gwcmd (might have to make the master independent first)?

213 mB is moderately hefty, but I would expect some errors in the logs if you're hitting a timeout or something. I just know there have been some problems in the past with larger backups syncing properly.

pturmel · November 2, 2022, 4:54pm

Would it be worth trying with the real backup from the Master? Or are there things in the transfer backup that are needed?

paul-griffith · November 2, 2022, 4:56pm

Modules, apparently? But if they're the same between the two gateways, a restore from the master is probably fine.

pturmel · November 2, 2022, 4:58pm

I can't conveniently make the server independent. It doesn't have internet access for pushing the license back to IA. Or can I move the .ipl files out of the folder?

paul-griffith · November 2, 2022, 5:00pm

Probably? I would expect trial or possibly emergency activation mode to suffice to get things in the right place.

pturmel · November 2, 2022, 5:13pm

I was able to make it independent with the license files temporarily renamed, and it picked the licenses back up after I renamed them back.

pturmel · November 2, 2022, 5:13pm

And it isn't syncing. Looping again. ):

paul-griffith · November 2, 2022, 5:23pm

Hmm. Maybe a jstack thread dump while it's in the loop would be helpful?

pturmel · November 2, 2022, 5:27pm

I think this is the key. Care to grep the source for me?

paul-griffith · November 2, 2022, 5:37pm

Check the redundancy.systemstateuid in redundancy.xml on both sides?
Or, less likely, the "SystemUID" column in SYSPROPS in the idb?

pturmel · November 2, 2022, 5:44pm

I think we have a winner. In the master:

<entry key="redundancy.systemstateuid">e47f3e14-c3be-4056-89e0-44e779a56aa9</entry>

In the backup:

<entry key="redundancy.systemstateuid">9b0a31a0-acb0-46c5-aaf7-bc78fcea5c2b</entry>

Testing.....

pturmel · November 2, 2022, 5:59pm

And not. These UUIDs are changing on each loop, and are different on master vs. backup. The value in the SYSPROPS table matches the value in redundancy.xml.

pturmel · November 2, 2022, 7:05pm

FWIW, this is fixed. Support found this in my wrapper log:

INFO | jvm 2 | 2022/11/02 11:38:22 | E [c.i.i.g.r.RedundantPersistenceInterfaceImpl$RunTask] [15:38:22]: Error applying redundant update.
INFO | jvm 2 | 2022/11/02 11:38:22 | simpleorm.dataset.SRecordInstance$BrokenOptimisticLockException: Broken Optimistic Lock [DeviceSettingsRecord 67 Dirty0]

I disabled that device. Something re-enabled it instantly. I deleted that device. Backup server stopped resyncing. /:

I have since found a script doing this, wildly, as a side effect of slight behavior difference in objectScript.

Oy!

djgilson · December 12, 2023, 7:22pm

Can you tell us what service was running that you deleted? From your screen shot I am not sure. I am having the same issue on 3 sets of redundant servers.
Thanks!

pturmel · December 12, 2023, 7:46pm

I didn't delete a service. I deleted a device connection under the OPC Server configuration. The root cause was a wild gateway script that was re-enabling a broken device--causing a "flood" of redundancy updates for which the backup could never "catch up".

Cole_Townsend · October 18, 2024, 2:57pm

I had a similar issue today but with a different device/lock.

Do you know if there is a reference somewhere to decipher what IncomingConnection 411 refers to? My first guess is a connection to the gateway network. Is this reference stored in the local database?