Gateway Redundancy fake switch

Mau89 · December 11, 2018, 8:51am

Hi everyone,
in the last weeks I have the folowing problem in a redundant application.
Some times, around midnight, clients lost comunication with the gateway for a couple of minutes (6-7 minute), some times is showed the info popup that check master and redundancy connections.

In master gateway log I found this;

MasterStateManager	11Dec2018 00:00:58
	Redundancy state changed: Role=Master, Activity level=Cold, Project state=Good, History level=Full
MasterTCPChannel	11Dec2018 00:00:58
	Peer node information has been updated: RedundancyNode(address=172.31.10.76, httpAddresses=[http://172.31.10.76:8088/main], sessionCount=0, activityLevel=Active, projectState=Good)
MasterTCPChannel	11Dec2018 00:00:57
	Received a full runtime state update from the other redundant node.
MasterTCPChannel	11Dec2018 00:00:57
	Peer node information has been updated: RedundancyNode(address=172.31.10.76, httpAddresses=[http://172.31.10.76:8088/main], sessionCount=0, activityLevel=Cold, projectState=Good)
MasterTCPChannel	11Dec2018 00:00:57
	Reporting master start time of Fri Dec 07 14:33:43 CET 2018

and after a some logs this:

MasterTCPChannel	11Dec2018 00:00:59
	Peer node information has been updated: RedundancyNode(address=172.31.10.76, httpAddresses=[http://172.31.10.76:8088/main], sessionCount=0, activityLevel=Cold, projectState=Good)
MasterStateManager	11Dec2018 00:00:59
	Redundancy state changed: Role=Master, Activity level=Active, Project state=Good, History level=Full

In backup gateway

Provider	11Dec2018 00:23:49
	Starting scan classes due to redundancy state change.
ModbusDriver2	11Dec2018 00:23:49
	Redundancy ActivityLevel changed from Cold -> Active, scheduling a connect.
ModbusDriver2	11Dec2018 00:23:49
	Redundancy ActivityLevel changed from Cold -> Active, scheduling a connect.
ModbusDriver2	11Dec2018 00:23:49
	Redundancy ActivityLevel changed from Cold -> Active, scheduling a connect.
ModbusDriver2	11Dec2018 00:23:49
	Redundancy ActivityLevel changed from Cold -> Active, scheduling a connect.
BackupStateManager	11Dec2018 00:23:49
	Redundancy state changed: Role=Backup, Activity level=Active, Project state=Good, History level=Full
BackupTCPChannel	11Dec2018 00:23:49
	For troubleshooting: The last received message was [CurrentStateMessage[activity=Active, sessions=16, projectstate=null]] The last sent message was [[RTSYNC_MSG, id=300]]
BackupTCPChannel	11Dec2018 00:00:57
	Received a full runtime state update from the other redundant node.
BackupTCPChannel	11Dec2018 00:00:56
	Peer node information has been updated: RedundancyNode(address=172.31.10.79, httpAddresses=[http://172.31.10.79:8088/main], sessionCount=16, activityLevel=Active, projectState=null)
ProjectRunner	11Dec2018 00:00:56
	Setting SQL Bridge project enabled state to 'DISABLED'
ProjectRunner	11Dec2018 00:00:56
	Setting SQL Bridge project enabled state to 'DISABLED'
ProjectRunner	11Dec2018 00:00:56
	Setting SQL Bridge project enabled state to 'DISABLED'
ProjectRunner	11Dec2018 00:00:56
	Setting SQL Bridge project enabled state to 'DISABLED'
Provider	11Dec2018 00:00:56
	Stopping scan classes due to redundancy state change.
Provider	11Dec2018 00:00:56
	Stopping scan classes due to redundancy state change.
Provider	11Dec2018 00:00:56
	Stopping scan classes due to redundancy state change.
Provider	11Dec2018 00:00:55
	Stopping scan classes due to redundancy state change.
BackupStateManager	11Dec2018 00:00:55
	Redundancy state changed: Role=Backup, Activity level=Cold, Project state=Good, History level=Full
BackupTCPChannel	11Dec2018 00:00:55
	Negotiated activity level has changed. The master node has asked this node to become 'not active'
ProjectRunner	11Dec2018 00:00:55
	Setting SQL Bridge project enabled state to 'ENABLED'
ProjectRunner	11Dec2018 00:00:55
	Setting SQL Bridge project enabled state to 'ENABLED'
ProjectRunner	11Dec2018 00:00:55
	Setting SQL Bridge project enabled state to 'ENABLED'
ProjectRunner	11Dec2018 00:00:55
	Setting SQL Bridge project enabled state to 'ENABLED'
Provider	11Dec2018 00:00:55
	Starting scan classes due to redundancy state change.
Provider	11Dec2018 00:00:55
	Starting scan classes due to redundancy state change.
Provider	11Dec2018 00:00:53
	Starting scan classes due to redundancy state change.
ModbusDriver2	11Dec2018 00:00:53
	Redundancy ActivityLevel changed from Cold -> Active, scheduling a connect.
ModbusDriver2	11Dec2018 00:00:53
	Redundancy ActivityLevel changed from Cold -> Active, scheduling a connect.
BackupTCPChannel	11Dec2018 00:00:53
	For troubleshooting: The last received message was [[VERSION_OK, id=101]] The last sent message was [CurrentStateMessage[activity=Cold, sessions=0, projectstate=Good]]
ModbusDriver2	11Dec2018 00:00:53
	Redundancy ActivityLevel changed from Cold -> Active, scheduling a connect.

and this logs are strange to me

Provider	11Dec2018 00:24:02
	Stopping scan classes due to redundancy state change.
Provider	11Dec2018 00:24:02
	Stopping scan classes due to redundancy state change.
BackupTCPChannel	11Dec2018 00:24:02
	Peer node information has been updated: RedundancyNode(address=172.31.10.79, httpAddresses=[http://172.31.10.79:8088/main], sessionCount=16, activityLevel=Active, projectState=null)
BackupTCPChannel	11Dec2018 00:24:02
	Project version synchronized, backup node is up-to-date.
BackupTCPChannel	11Dec2018 00:24:02
	Peer node information has been updated: RedundancyNode(address=172.31.10.79, httpAddresses=[http://172.31.10.79:8088/main], sessionCount=16, activityLevel=Active, projectState=Good)
BackupTCPChannel	11Dec2018 00:24:02
	Peer node information has been updated: RedundancyNode(address=null, httpAddresses=null, sessionCount=16, activityLevel=Active, projectState=null)
BackupTCPChannel	11Dec2018 00:24:02
	Master start time was reported to be 'Tue Dec 11 00:01:02 CET 2018' (adjusted to backup clock)
BackupTCPChannel	11Dec2018 00:24:02
	Server time sync complete. Server time is different by 2968 ms.
Provider	11Dec2018 00:24:02
	Stopping scan classes due to redundancy state change.
Provider	11Dec2018 00:24:02
	Starting scan classes due to redundancy state change.
Provider	11Dec2018 00:24:02
	Starting scan classes due to redundancy state change.

It seems that, for some reason it required a switch to backup and immediately a recover on master.

Some ideas?

Thank you

pturmel · December 11, 2018, 1:37pm

My first suspicion would be a network problem. Second suspicion would be a java crash or GC pause-the-world on the master.

Mau89 · December 11, 2018, 4:00pm

Hi,
I think isn’t GC pause-the-world, beacause during the fake switch and immediately after the memory doesn’t fall down but has incresed. If was a java crash I expected to find a reboot of the master gateway, but it is running since some week.
Some network problems also seems to me the most probable cause, but from log of lan-switch I didn’t find anything at the moment.

Thanks