Detecting lost connection to the gateway

haakon.soehoel · August 29, 2023, 9:46am

We are testing how perspective clients react to a communication loss to the gateway due to a cable break or the gateway loosing power. We observe that it takes 2,5 minutes before the client displays the "No Connection to Gateway" banner.

Looking into the browser console, we see that after approximately 2 minutes and 15 seconds we get the message store.Channel: Disconnecting websocket with message: "Idle time reached."

After 15 more seconds we get the message store.Channel: Websocket connection closed. code=1006, wasClean=false, reason=No reason given, codeMeaning=Normal Closure, codeDescription=Reserved. Indicates that a connection was closed abnormally (that is, with no close frame being sent) when a status code is expected.

Is this the expected behavior, or should the client detect a lost connection faster? Is there any settings we've missed to decrease the detection time in this fault scenario?

pturmel · August 29, 2023, 12:08pm

You are dependent on internet standards here, for the most part. The standard for connection loss in TCP/IP is 90 seconds. HTTP and Websockets add another layer of delays.

If you had a session heartbeat value, you could make a custom component (via SDK) that had appropriate javascript on the browser side to report loss of heartbeat more quickly. I'm not aware of anything in Perspective at this time that would provide this functionality. (Redundancy support works faster than two minutes, so there's probably something private in IA's javascript that is doing something similar.)

Perhaps a feature request would be appropriate.

haakon.soehoel · August 29, 2023, 1:21pm

Thanks for the feedback.

We're currently testing with a redundant setup. The switchover to the backup gateway is triggered 10 seconds after the connection loss i detected. I guess this is controlled by the "Failover Timeout" setting in the Redundancy Settings of the gateway. So the issue is that it takes 2,5 minutes before the connection loss is discovered by the client when there is a cable break or power loss.

If we stop the ignition service, or turn of the gateway server in a controlled manner the Websocket connection closes properly immediately, instead of waiting for a timeout, and thus the client switches to the backup gateway after only 10 seconds. store.Channel: Websocket connection closed. code=1001, wasClean=false, reason=Container being shut down, codeMeaning=Going Away, codeDescription=The endpoint is going away, either because of a server failure or because the browser is navigating away from the page that opened the connection.

A crucial part of a redundant system is the ability to detect and react to an unexpected failure like a broken cable or a sudden power loss, in order to retain normal operation. I definitely think there should be some sort of built in heartbeat functionality to detect an unintentional/unplanned fallout of a gateway in a few seconds rather than in a few minutes. So a feature request from me on this one (if it's not already planned)

Felipe_CRM · August 29, 2023, 1:35pm

If you haven't already, please submit the feature request here: Ignition features and ideas

I'd be glad to vote for something like this as well!

haakon.soehoel · August 29, 2023, 2:42pm

I've added it as a feature request here:

Make perspective client detect lost connection to the gateway | Voters | Inductive Automation (canny.io)

ataya · April 26, 2024, 3:05am

We are having a similar issue as in this thread, where our unattended clients on TVs stop updating. However, different than this issue, we don't get the "No Connection to Gateway" banner, and gateways don't seem to have a downtime when the client disconnect happens.

The issue seems to be websocket max idle time (120000ms) timeout.

Not sure where that gets set, or not sure why the websocket thinks it is idle, because there are a lot of data flowing all the time.

I'd appreciate if anyone had a similar issue and solved it can chime in.

The platform is kubernetes, using 8.1.35 official image, with nginx ingress controller. I have proxy-read-timeout and proxy-write-timeout set to 7200 seconds after reading this document to rule out the nginx configuration causing the issue.

Would there be any other setting in the network, gateway, project etc. or can anything be done on the project to mimic a ping-pong mechanism? Ideally, we would want the client to reconnect or recreate a new websocket after a timeout, or not timeout at all.

cc: @kcollins1

kcollins1 · April 26, 2024, 6:45pm

@ataya, you should be seeing keepalive messages across the websocket channel every 30 seconds, which should be enough to keep things working. I spooled up a simple VM with K3s (with the default Traefik ingress disabled) and Nginx Ingress Controller (v3.5.0) installed and set as default ingress class. With a most basic Perspective view (with a single value that only updates every 5 minutes), things seemed to work just as expected. I only needed to add this annotation to my Ingress to get websockets working properly:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ign-backend-primary
  annotations:
    # This should point to the name of the target service
    nginx.org/websocket-services: "ign-backend-primary"
...

Can you provide more details on your K8s environment?

kcollins1 · April 26, 2024, 8:11pm

@ataya, I may have reproduced something similar to what you've described above. In that same setup, I observed that after the machine I had opened a browser from to test with went to sleep, coming back it ended up reconnecting, but tag value (my every-5-minute tag) was still as-before. It wasn't until it updated on the gateway again that a new value finally showed up. I tested a similar setup on a standard installation, and the browser refreshed upon connectivity restoration, thus syncing everything back up. There appears to be a different behavior under this reverse proxy configuration. This will need to be studied in more detail. You might reach out to support and open a ticket there.

ataya · April 27, 2024, 7:33pm

Thank you @kcollins1

I'll add the annotation and observe. The timeouts don't happen all the time, currently the clients are up an running for 2 days after the last timeout.

My assumption is that there might be some network hiccups which may be combined with changes to the backend service, or gateway restarts, thus losing the perspective sessions on the gateway and not being able to initiate a new websocket connection.

If this is bi-directional between gateway and client: "keepalive messages across the websocket channel every 30 seconds, which should be enough to keep things working" I think it is good enough to rule out application level.

Thanks for testing this I appreciate your support