Perspective Websockets

Tyler_Bennet · June 28, 2024, 5:58pm

Hello,

I'm curious to see how I can get Postman to interact with the Ignition Perspective Websockets. When I try to query the websocket I receive an HTTP 500 status code using the websocket feature within Postman. I need the ability to query the websockets outside of Ignition to help ensure the health is working fine. If you observe launching Perspective for a shop board, and record the network path in Chrome you can see the wss:// connection example below. I suspect that based on the page creating a javascript and embedding iframes there may be some auth pieces or requirements with the websocket. Can I get some more info so we can add more layers of observability to monitor our apps health?

Error: Unexpected server response: 500

Data below is intentionally not mirrored to our prod/dev systems.

wss://someurl.internal.domain.com/system/pws/shop-scada/2dsfaf1293?token=8Jfrk90Hwv_g1vasdfasdfasdfNzjuRnA

YF129701 · July 1, 2024, 2:50pm

I've looked into this before and every time I did, I would hit a wall. I believe the Perspective message handling system uses websockets, so that's why you may observe it on the Network tab of your browser's dev tools. What I was trying to do was to send a message to the host Perspective view from an embedded WebDev HTML resource with some specialized charting library (to have the embedded view and the host view interact to some extent) - no go.

Also, what do you mean by

In my mind, there is no correlation between monitoring your gateway health and being able to use the under-the-hood websockets protocol. I don't think IA will be very quick to divulge details about how it works - this is a core part of Perspective so if misused, will likely break things.

Tyler_Bennet · July 1, 2024, 3:30pm

We are running into various instances as an enterprise where websockets time out. In order for any mature technology or partner to build observability around applications understanding some of the components that connect are usually required. In this case, we just need the ability to measure the health of client connections you can do this in many form factors but in our case we need to isolate the websocket connections as they appear to drop connectivity for "various" reasons.

There absolutely is a correlation to measuring gateway health. We see this on a daily basis as being a problem. I've already added client connection monitoring but that's all feeding off the browser itself.

I've seen on different forum postings suggestions to use java libraries which is fine, but I'd like to isolate the calls if at all possible.

To troubleshoot any typical technology like REST or Websockets you would typically break the call down to headers, body, tokens, etc. You can implement a test against auth to pull tokens possibly followed by the actual call. Many, many, many load testing tools either do this for you or you can integrate into to test the health and responsiveness.

We are looking to capture a few things but as an SRE I have to have the ability to implement things like this or I have to look at going around this to make my own exporters/monitors OR finding other products that work well.

cmallonee · July 1, 2024, 3:56pm

I've never had luck interfacing with a Session's websocket connection via Postman or during automated testing (Selenium).

As for the original purpose of determining session/Gateway health or determining the root cause of websocket time-outs, I don't know that monitoring the websocket would even help.

Think of the websocket as an active phone line. There's not always someone speaking, but the lines are held active, always ready in case someone does speak. Now, monitoring this websocket connection through some external tool is not going to do anything other than tell you when the Gateway or Session "speaks"; even if you can rig something up to tell you when the connection is broken, well, the Gateway is already telling you that.

Some common reasons the websocket connection could time-out/close:

The last browser tab of the session is closed.
The network connection of that session is interrupted. Could be network issues, or unintentional I.T. interference, or OS update... Hundreds of things can cause this.
The Gateway or Session tried to pass along a piece of data which was too big. We see this sometimes when rendering non-virtualized Table data or very large datasets. There's no discrete "row" count that causes the issue because column count and data types factor into the package size.
The session reaches the max lifespan as configured in the project's settings.
Session inactivity (again, a project setting).
Native app user minimizes the application.
Project restart (after import of resources through Gateway).

YF129701 · July 1, 2024, 3:56pm

Are you determining that they're dropping by looking at the gateway logs or is Perspective not working as it's supposed to (message handlers not being received, weird communication issues, etc)?

Tyler_Bennet · July 1, 2024, 4:24pm

Yeah, from the VM or K8s side we see the pod or VM puking that it kills the connection. Regardless of the ingress (nginx) or just forwarding the traffic we see a pattern exhibited that the pods or VMs kill connections on the top of the hour. (our VMS and K8s run on entirely different platforms in a few cases and the problem seems to be systematic of it always happens on the top of an hour during the M-F, never on the weekends).

The network layer is basically golden, no saturation, loops, drops, etc. (Wireshark) seems to indicate that the pod/VM is killing the connection in addition to NGINX logs. (no other parts of our plant/systems exhibit issues)

The client connections are trying to send heartbeats but they never get delivered when the issue happens. It causes (ALL) client connections to drop to those gateways. Most of the time all gateways go down at the same time. Restarting the sessions brings up perspective sessions as if they never had an issue.

Confirmed no database backups, etcd backups, or really any backups occur during the times we see impacts.
Network activity slightly drops during the impact but only to Ignition systems.
Hardware is running agnostic on different platforms and VMs vs K8s
Network shows nothing glaring
Nginx indicates that the upstream server is killing the connection.
Session inactivity is set to a very large number for both the Project and Network settings
Nginx/F5 shows a high inactivity setting (app would force inactivity first).
Projects shouldn't restart per Logs & K8s we don't see indication of restarts.
Browser sessions are on linux screens (think of dumb TVs that are more than powerful enough on mem/cpu/disk). (node exporters show no inclination of problems in all resources, just a slight drop in network traffic when the issue happens).

I have alleviated the issue by implementing client connection monitoring that restarts the chrome session which bandaides the issue. When we do this though we cause the VM/Pods to spike in memory from the sessions re-establishing.

My best guess after digging into each tech stack is that Ignition as a whole is triggering some sort of automated process either as the result of a user or a system process that makes all perspective boards puke at the same time. It's always on the top of the hour but never the same. Usually M-F somewhere between 11-3 PM. Sometimes it happens outside that pattern but always at the top of an hour.

I do see random pieces of data being too big from Nginx's logs but this doesn't happen frequently and rarely when the issue happens and it's only from one client usually.

pturmel · July 1, 2024, 4:31pm

These two items make me suspect this is neither gateways nor infrastructure, but client OS & client IT stacks interfering on a schedule.

Have you tried alternate client operating systems?

Tyler_Bennet · July 1, 2024, 4:42pm

Great question! Yes, I've tried from Macbooks, Tablets, TVs (computers attached), and HMI devices. Generally all chrome, but the problem happens on any browser tested.

My team directly manages all of the platforms and I combed through the devices we standardized. Nothing occurs in a pattern that would impact that. We use IaC to manage everything.

cmallonee · July 1, 2024, 4:42pm

This highly suggests several potential culprits:

Scheduled I.T. tasks (like file system scanning).
Scheduled backups.
Scheduled scripts.

Tyler_Bennet · July 1, 2024, 4:46pm

Our backups never occur during prod hours. Our IT tasks only occur in specific time windows, which I've narrowed 100% to our prod downtime aka it won't trigger during the hours we see theses issues. The scripts are all maintained through our tools/repos so I have direct control on what runs vs doesn't

cmallonee · July 1, 2024, 5:19pm

I'm not sure what to tell you. Something is happening at these times which is causing your issue, but monitoring the websockets themselves is not likely to help you. Your time is probably better spent determining what is happening either on the OS, on the network, or what is happening within the Gateway at these times. Even if it's just a mass tag update triggering scripting elsewhere, or a device restarting, something is happening on what seems to be a schedule.