Gateway Network + Store & Forward Issues

Louis_Whitburn · December 21, 2022, 3:16am

Hello

Our architecture consists of several sites (sites A, B and C, etc.) and a central site (the Hub). All of these sites have redundant Ignition gateways. After upgrading our gateways from version 8.1.14 to 8.1.23 we have noticed that one (and only one) of the gateways (site A master) has had its Gateway Network connection to the Hub master gateway fault and then reconnect roughly every hour. There seems to be no obvious reason for why this would be.
The Ignition logs state that the connection has faulted due to 31 pings failing, which I have been able to confirm after setting "metro.Transports.Websocket.WebSocketConnection" to "Debug" in the log viewer and observing that there are many outgoing messages and no incoming messages leading up to the fault. Despite this, a packet capture on port 8060 shows continuous bidirectional TLS communication leading up to the "ping failure", albeit:

There are a large number of TCP RSTs before the "ping failure" (5.41 is the Hub master, 2.10 is the Site A master), and

image1510×397 56.1 KB
The rate at which packets are sent falls off quite significantly. The traffic looks like a web socket ping being sent but all that appears to comes back is a TCP ACK, without any web socket data returned.

During the above ICMP pings continued to be sent and received.
This would imply that the issue is with Ignition and not the network. Would anyone have any insight into this?

Another possibly related issue is that after these disconnects and reconnects we get a few hundred items quarantined on the Hub (our History database Store & Forward connection) with the message

Violation of PRIMARY KEY constraint 'PK__sqlt_dat__BE126DD1575BC9AF'. Cannot insert duplicate key in object 'dbo.sqlt_data_18_2022_12'. The duplicate key value is (24300, 1671591717020).

I believe that the issues are related since the issue seems to keep reappearing after the Gateway Network connections fault, however I haven't been able to prove that they are related. Interestingly the quarantined data includes tags from sites other than Site A. This issue has persisted ever since we deployed Ignition and is more of an annoyance than anything else, however the correlation between this issue and the gateway network fault made me wonder if they were related.
Our tag history architecture here is a tag history splitter on each of the spoke sites, splitting to a local datasource history provider first and then a remote tag history provider, pointing to a datasource history provider on the Hub. I would also appreciate if anyone has any insights into this issue.

Thanks

qpadgham · December 28, 2022, 10:01pm

Hey Louis,

I would suggest opening up a ticket with our support team to take a look into this issue, particularly since it seems to have started with an upgrade and is probably a bit more technical than is likely to be solved solely through the forums. You can find our contact information here: https://support.inductiveautomation.com/hc/en-us.

When you make your ticket, supplying the Wireshark capture you have shown as well as the gateway logs from the various gateways involved would help speed things along. In particular, I would suggest getting thread dumps from both gateways while they're in the disconnected state or close to it, if that's possible to determine.

Regards,

Louis_Whitburn · March 11, 2023, 8:33am

Just updating this with extra information after contacting support in case anyone else has the same problem. We've yet to figure out why this is happening, but there was a script on the Hub which uses system.tag.browse() and then system.tag.readBlocking() to read a specific parameter from each UDT instance in those remote tag providers. For some reason at one site this was causing gateway network faults. I'll update this thread if I uncover anything further.

Louis_Whitburn · April 11, 2024, 1:03am

I stumbled across this when trying to find information for another GAN issue that we'd been having so I thought I'd update this since we have uncovered more information since I first posted this.
I suspect that the root cause is that in versions 8.1.2 and above a bug was introduced where reading UDT instances (or any folder's '.jsonValues' tag) meant that for each tag in that UDT instance / folder (the same thing really) the gateway would try to enumerate the available alarm pipelines. This was recursive, so included nested folders etc. We use remote notification profiles on all our sites that connect to a gateway in the cloud with a ~50ms RTT. In order to enumerate the pipelines for each tag it would have to query the remote gateway. So if a UDT had 10 tags that would be 10 * 50ms RTT or 500ms to read those tags both from the remote gateway and on the local gateway.
This bug hasn't been fixed yet.
While the bug above wasn't affecting the tag reads I mentioned in my previous post (we were just reading single tags in a UDT) the issue is that during the time of a UDT instance read the tag provide thread will block while it's waiting for it to complete to avoid race conditions. What was happening was that something else was reading UDT instances in that tag provider which slowed down the tag read that we were trying to do here.