Hello
Our architecture consists of several sites (sites A, B and C, etc.) and a central site (the Hub). All of these sites have redundant Ignition gateways. After upgrading our gateways from version 8.1.14 to 8.1.23 we have noticed that one (and only one) of the gateways (site A master) has had its Gateway Network connection to the Hub master gateway fault and then reconnect roughly every hour. There seems to be no obvious reason for why this would be.
The Ignition logs state that the connection has faulted due to 31 pings failing, which I have been able to confirm after setting "metro.Transports.Websocket.WebSocketConnection" to "Debug" in the log viewer and observing that there are many outgoing messages and no incoming messages leading up to the fault. Despite this, a packet capture on port 8060 shows continuous bidirectional TLS communication leading up to the "ping failure", albeit:
- There are a large number of TCP RSTs before the "ping failure" (5.41 is the Hub master, 2.10 is the Site A master), and
- The rate at which packets are sent falls off quite significantly. The traffic looks like a web socket ping being sent but all that appears to comes back is a TCP ACK, without any web socket data returned.
During the above ICMP pings continued to be sent and received.
This would imply that the issue is with Ignition and not the network. Would anyone have any insight into this?
Another possibly related issue is that after these disconnects and reconnects we get a few hundred items quarantined on the Hub (our History database Store & Forward connection) with the message
Violation of PRIMARY KEY constraint 'PK__sqlt_dat__BE126DD1575BC9AF'. Cannot insert duplicate key in object 'dbo.sqlt_data_18_2022_12'. The duplicate key value is (24300, 1671591717020).
I believe that the issues are related since the issue seems to keep reappearing after the Gateway Network connections fault, however I haven't been able to prove that they are related. Interestingly the quarantined data includes tags from sites other than Site A. This issue has persisted ever since we deployed Ignition and is more of an annoyance than anything else, however the correlation between this issue and the gateway network fault made me wonder if they were related.
Our tag history architecture here is a tag history splitter on each of the spoke sites, splitting to a local datasource history provider first and then a remote tag history provider, pointing to a datasource history provider on the Hub. I would also appreciate if anyone has any insights into this issue.
Thanks