Redundant Pair Backup Resync Repeating

I have a newly upgraded redundant pair (v7.9.18 => v8.1.21) whose backup server won't synchronize. A snip from the log where it decides to do a full resync looks like this:

INFO | jvm 3 | 2022/11/02 11:40:24 | I [R.S.BackupStateManager ] [15:40:24]: Redundancy state changed: Role=Backup, Activity level=Warm, Project state=Good, History level=Full redundant-provider=system
INFO | jvm 3 | 2022/11/02 11:40:24 | I [c.i.i.g.o.s.IgnitionOpcUaServer$OpcUaRedundancyListener] [15:40:24]: redundancyStateChanged: Role=Backup, Activity level=Warm, Project state=Good, History level=Full
INFO | jvm 3 | 2022/11/02 11:41:05 | I [R.S.SyncManager ] [15:41:04]: [internaldb] Will perform full pull. Reason: major change detected between nodes (sync provider UUIDs did not match). gan-remote-gateway-name=Ignition-801SCADA-Master, redundant-provider=internaldb
INFO | jvm 3 | 2022/11/02 11:41:05 | I [R.S.SyncManager ] [15:41:04]: Sync provider 'internaldb' requires full sync and will result in a restart. gan-remote-gateway-name=Ignition-801SCADA-Master, redundant-provider=internaldb
INFO | jvm 3 | 2022/11/02 11:41:05 | I [R.S.BackupStateManager ] [15:41:04]: Redundancy state changed: Role=Backup, Activity level=Warm, Project state=OutOfDate, History level=Full gan-remote-gateway-name=Ignition-801SCADA-Master, redundant-provider=system
INFO | jvm 3 | 2022/11/02 11:41:05 | I [c.i.i.g.o.s.IgnitionOpcUaServer$OpcUaRedundancyListener] [15:41:04]: redundancyStateChanged: Role=Backup, Activity level=Warm, Project state=OutOfDate, History level=Full gan-remote-gateway-name=Ignition-801SCADA-Master
INFO | jvm 3 | 2022/11/02 11:41:05 | I [R.S.BackupStateManager ] [15:41:04]: Peer node information has been updated: RedundancyNode(address=172.16.70.8, httpAddresses=[http://172.26.70.8:8088], activityLevel=Active, projectState=OutOfDate) gan-remote-gateway-name=Ignition-801SCADA-Master, redundant-provider=system
INFO | jvm 3 | 2022/11/02 11:41:12 | I [R.S.BackupStateManager ] [15:41:12]: System restore file successfully downloaded and processed. System will now restart. gan-remote-gateway-name=Ignition-801SCADA-Master, redundant-provider=system
INFO | jvm 3 | 2022/11/02 11:41:12 | I [IgnitionGateway ] [15:41:12]: Restarting JVM gan-remote-gateway-name=Ignition-801SCADA-Master
INFO | jvm 3 | 2022/11/02 11:41:13 | I [IgnitionGateway ] [15:41:13]: Ignition[state=RUNNING] ContextState = STOPPINGAfter repeating five times, the wrapper kills the backup server's service. 

I have ticket #63436 started, but I was wondering if anyone else has seen this (or similar).

The wrapper kills off the backup gateway's service after it loops five times.

How large is the state-transfer .gwbk in the data dir on the backup? If you open it, does it look well-formed?

213 MB.

Ark opens it fine. An actual gateway backup from the master is ~ 77 MB.

Opening the db_backup_sqlite.idb file is fine, and passes an integrity check in my SQLite DB Browser.

Can you manually restore the state transfer via gwcmd (might have to make the master independent first)?

213 mB is moderately hefty, but I would expect some errors in the logs if you're hitting a timeout or something. I just know there have been some problems in the past with larger backups syncing properly.

Would it be worth trying with the real backup from the Master? Or are there things in the transfer backup that are needed?

Modules, apparently? But if they're the same between the two gateways, a restore from the master is probably fine.

I can't conveniently make the server independent. It doesn't have internet access for pushing the license back to IA. Or can I move the .ipl files out of the folder?

Probably? I would expect trial or possibly emergency activation mode to suffice to get things in the right place.

I was able to make it independent with the license files temporarily renamed, and it picked the licenses back up after I renamed them back.

And it isn't syncing. Looping again. ):

Hmm. Maybe a jstack thread dump while it's in the loop would be helpful?

I think this is the key. Care to grep the source for me?

Check the redundancy.systemstateuid in redundancy.xml on both sides?
Or, less likely, the "SystemUID" column in SYSPROPS in the idb?

I think we have a winner. In the master:

<entry key="redundancy.systemstateuid">e47f3e14-c3be-4056-89e0-44e779a56aa9</entry>

In the backup:

<entry key="redundancy.systemstateuid">9b0a31a0-acb0-46c5-aaf7-bc78fcea5c2b</entry>

Testing.....

And not. These UUIDs are changing on each loop, and are different on master vs. backup. The value in the SYSPROPS table matches the value in redundancy.xml.

FWIW, this is fixed. Support found this in my wrapper log:

INFO | jvm 2 | 2022/11/02 11:38:22 | E [c.i.i.g.r.RedundantPersistenceInterfaceImpl$RunTask] [15:38:22]: Error applying redundant update.
INFO | jvm 2 | 2022/11/02 11:38:22 | simpleorm.dataset.SRecordInstance$BrokenOptimisticLockException: Broken Optimistic Lock [DeviceSettingsRecord 67 Dirty0]

I disabled that device. Something re-enabled it instantly. I deleted that device. Backup server stopped resyncing. /:

I have since found a script doing this, wildly, as a side effect of slight behavior difference in objectScript.

Oy!

1 Like

Can you tell us what service was running that you deleted? From your screen shot I am not sure. I am having the same issue on 3 sets of redundant servers.
Thanks!

I didn't delete a service. I deleted a device connection under the OPC Server configuration. The root cause was a wild gateway script that was re-enabling a broken device--causing a "flood" of redundancy updates for which the backup could never "catch up".