Ignition Websockets for Scada Dashboards

I'm noticing a trend here at 12:45 PM CST everyday we lose Andon connections to our Scada Dashboards across all of our Scada Endpoints. This originated last year at 1:01 PM CST, around the end of last year Nov/Dec it swapped to 12:45 PM CST.

Issue happens across a variety of devices/browsers.

The Scada endpoint themselves if you bring up any of the ignition home/status/config pages do not receive timeouts.

  • Scada Endpoints remain online and visibly Healthy
  • Container probes do not fail or cause the pods to go offline/restart.
  • The Scada Dashboards within those endpoints throw HTTP 404 eventually
  • HTTP 200 Status code responses we receive per heartbeat drops to 0 for the Scada Dashboards over the course of about 1 minute and remains disconnected until we refresh the Scada Dashboard on the client sessions.
  • Andons/Clients receive a visible "No Connection to Gateway" to the Dashboard pages ONLY
  • Console of webpages (Chrome shows) - Failed to load resource: the server responded with a status of 404 (Not Found)
  • ~12:46 the endpoints all show connection issues to Tag endpoints.
  • CPU/Memory/Disk all seems fine.
  • No other workloads appear impacted
  • HMI connections to other Ignition endpoints with Perspective works great/no reported issues.
  • Ignition Databases resources are great no issues or spikes.
  • Container resources are all fine on Tag/HMI/Scada endpoints.
  • Network traffic really is minimal, and not impacted other than the websockets not passing happily along.
  • PCAPs show Scada containers kills the websocket
  • Nginx ingress controllers do not show reloads at the time of impacts, or really hardly ever.
  • Only occurs M, T, W, TH, F, never on weekends.

Ignition Version: 8.1.35
We are trialing 8.1.44 now to address other log issues but it will be a while before we can massively roll this everywhere.

https://ignition-shop.someurl.com/data/perspective/client/shop01-scada/andon/50de1afadfeb-8392-17b791422ec0)

I attached a few screenshots of the logs from 3 of the gateways

Check with your IT department to see if they have any scheduled tasks at that time during those days.

That exact timing is suspicious, and if it was a cyclical issue (max timeout on a websocket or similar) I would expect you to get it on the weekends as well.

There are no scheduled tasks during this time. I've spent months hunting and checking. Backups don't occur, upgrades/updates, nothing happens during those times.

Are these endpoints on a different network, or connected to a different switch than the dashboard endpoints (Or any of the ones having issues)?

I'd start working my way up the connection chain starting at whatever endpoints are having issues and checking each piece of connection equipment (switches/routers) back to the gateway looking for anything scheduled on those devices or maybe anything in the logs. It could be something as simple as a switch restarting to try to apply a config update or similar.

Edit: I just noticed in the top of your screenshot it looks like a connection to another tag server (or some other GAN connection) turned to faulted.

1 Like

These generally all run within Kubernetes, think a flat network gigantic network of IPs. We do run VMs for our Kubernetes nodes, with hundreds of other workloads spread across. Logically these are split by namespaces that can talk to each other via in ingress or via service. We see the issues even within internal Kubernetes connectivity.

In addition, the PCAPs indicate that the Scada Pods are Killing the TCP/HTTP connections. Nginx also confirms in logs that the "Upstream server" aka the containers in the pod are killing the connections.

None of our other workloads experience any connectivity issues or even have issues at those times. We run hundreds if not thousands of workloads across our clusters with no problems.

Sample Connectivity
Client -> DNS -> F5 -> K8s Node -> Virtual Workload (POD) -> (Container)

(Container) -> Virtual Workload (POD) -> K8s Node -> K8s DNS -> K8s Node -> Virtual Workload (POD) -> (Container)

(Container) -> Virtual Workload (POD) -> K8s Node -> K8s DNS -> F5 DNS -> K8s Node -> Virtual Workload (POD) -> (Container)

The switches are managed in between nodes, highly overkill for this environment providing high bandwidth/throughput, and do not have issues.

We've tested moving workloads across different ports, even running on the same nodes. The virtualized nodes have moved clusters to eliminate any contention.

Hypervisor isn't shifting workloads across nodes really ever, and capacity has more than plenty enough resources that physical hosts are not exhausted.

Perhaps you need to test outside Kubernetes. Your description of the PCAP evidence is suspicious. (Perhaps you can share an example of that kill?)

Java delegates low-level TCP to your OS, so if the "kill" is an injected FIN, it is almost certainly the OS's fault (whatever container tech Kubernetes uses).

Can you share details on your Service and Ingress configurations? Specifically what type of backing service and what ingress annotations you're using to drive your ingress controller?

Further, have you been able to open dev tools on one of these browser sessions during this scenario and get details on the hello request that should occur periodically during connectivity loss?

apparently there is a gateway network issue between kubernetes nodes. I've heard some talk about Helm being a work around for this but i dont know specifics.