Issues re-establishing a session with external OPC server after the external server is rebooted or restarted

At our company, we connect ignition to another OPC UA server that are setup on other VMs that are hosted by us. Ignition is the client in this case. The tags associated with the server are pulled into an ignition project and used to communicate with other devices that are connected to the Ignition OPC UA server.

When our server is restarted, the OPC connection with our OPC UA server should be re-established on our OPC UA server coming back up. However, it doesn’t seem like the connection gets restored with the default OPC UA connection settings (or at least ignition doesn’t connect to a new session). I was able to modify 2 setting in order to allow reconnection when our server is restarted:

Keep-Alive Interval parameter on the ignition VM for this OPC connection was modified to 500 ms (default = 15000 ms)

Keep-Alive Timeout parameter on the ignition VM for this OPC connection was modified to 500 ms (default = 10000 ms)

This however led to memory leak issues on our OPC server VM. Also, we didn’t have to do this in v7.9.x. This issue seems to be very particular to v8.0x. (currently on v8.0.9).

My questions are as follows:

  1. I would like to understand what each of these settings actually does (details would be appreciated)
    a. Keep-Alive Interval
    b. Keep-Alive Timeout
    c. Connect Timeout
    d. Acknowledge Timeout
    e. Request Timeout
    f. Session Timeout
    I referred to https://docs.inductiveautomation.com/display/DOC80/OPC+UA+Client+Connection+Settings web page but I don’t clearly understand the implications of changing each of the above settings and how it would help (or not help).

  2. Are there any changes from ignition v7.9.x to v8.0.x that would have led to these issues occurring after updating to v8.0.x of ignition?

You shouldn’t need to mess with any of those settings.

The easiest way to troubleshoot this would be to configure the connection to use no security, start a Wireshark capture, and then restart the other OPC server. Stop the capture after the server has been up for a couple minutes and then we can take a look at what’s going on.

Additionally, searching for the following loggers in Ignition and setting them to DEBUG may help: ClientManager, ChannelFsm, SessionFsm

Is this recommendation to help troubleshoot the reconnection issue while keeping the keep-alive settings to default?

If so, how do I check that the connection is configured with no security before we proceed with the wireshark capture?

Also, where are these loggers you’ve mentioned and how do they get setup to DEBUG?

On the OPC connections page in the gateway, for the connection to this server, go through the endpoint discovery wizard again and choose an endpoint with no security. If one isn’t offered you’ll have to modify the server’s configuration to offer one.

Status > Logs > Click the gear icon > search for each logger:

@Kevin.Herron Re-starting this discussion which we never got to the bottom of on our end. Resuming testing on why ignition fails to reconnect reliably to one of our OPC Servers (Appliance Proxy) when the server is restarted. I’ve enabled the debug logging for the loggers you recommended in your previous comment.

I have 2 log captures, one before we performed the restart of the processes that control our Appliance Proxy OPC server, and one after. We did 2 full restarts that resulted in ignition successfully reconnecting but the third restart of only the process that is associated with the OPC server did not result in a successful connection. It all happens around 12:00 pm EST (16:00 UTC). However, the idb files are too large to attach to this comment. What’s the best way to share it with you? If I upload them to a google drive and share the link, will that work? Is there an email ID I can share with?

For wireshark, is there a recommended filter or settings to do the capture once it is installed on a linux based VM hosting ignition?

You can upload and email a link to me (first name @ inductiveautomation dot com), but unless you’ve enabled a specific set of loggers they won’t be of any use tracking this down without a Wireshark capture to go with it.

Zipping the 5 or 6 wrapper.log files from the $IGNITION/logs directory might be smaller and easier to send.

edit: oops, should have re-read the thread closer. If you enabled the loggers mentioned above they may help.

Thanks Kevin. I’ve shared a Google Drive link with you via email. Please let me know if you don’t receive it.

Also, looking for your advice on the wireshark question. A colleague of mine got wireshark installed on our linux VM that hosts ignition but we’re not so sure about the settings or filters needed to trigger the capture.

From what I can tell looking at the logs, you restarted that server around 0900-ish, the keep alive request failed once, then twice, which caused Ignition’s OPC UA client to go into reconnect mode. This is basically expected if you restarted the remote server.

It does successfully reconnect, but from that point the server both fails to respond to CreateSubscription requests within 60s and also fails to answer the keep alive requests (just a read of the ServerState variable), which causes Ignition to reconnect again, and it just continues like this for the remaining 3 minutes of the logs.

You’re probably going to need to get the vendor of that server involved and have them help you figure out why it stops responding or can’t respond after being restarted.

Thanks Kevin. I am the ‘colleague’.

Follow up:
We can reconnect to the server from ignition if we go to the ‘OPC Connections’ setup, then edit the affected connection and then save (without making any changes).

Question: Is there some material difference under the hood between when ignition reconnects vs what I assume is more of an ‘initial’ connection that occurs when we do the edit-> save on the connection page?

I think the only real difference is that a reconnect involves an attempt to transfer subscriptions after the session is created and activated and possibly some subtle timing differences. I’m not sure what else…

1 Like

As far as the Wireshark capture goes, using a host filter should be enough to keep the traffic down:

If you’re capturing from the machine running the Ignition gateway then use the IP address of the remote OPC server. You’ll have to figure out the correct adapter for your setup.

2 Likes

Hello Kevin,
I managed to grab packet captures from both the ignition machine and the server. I merged them together in wireshark, and they are in the same share that @aiyer sent previously. mergedCaptureOPC_Ign_FM.pcap.

Some notes on sequencing/ timestamps (approx):
0s - 27s Normal operation
27s - 35s Our OPC UA server restarting
70s Ignition indicates connection faulted
80s Edit connection & Save in ignition
~90s Connection restored.

I looked at the captures myself, and it does appear (to my untrained eye) that one of the differences between ignition reconnecting vs. the edit/save is that on the latter there is a CloseSecureChannel that is sent followed by a OpenSecureChannel. I don’t see this sequence of events in normal reconnect.

That sounds right, there wouldn’t be an explicit CloseSecureChannel when we’ve already determined the connection is lost. Pretty sure that only happens on connection shutdown (edit/save or delete connection).

For troubleshooting this I do actually need this capture except where you don’t intervene with an edit/save - in the previous logs, you can see reconnect attempts from Ignition to the server, but it responds with a ServiceFault indicating Bad_SessionIdInvalid as soon as it receives the first PublishRequest from Ignition and I need to see if it’s actually invalid or not.

I still have the raw tcpdump outputs (assuming that is what you needed). They are in the directory now with the names ign_capture.pcap and fm_capture.pcap

Scratch that… I thought you meant my edit/saving of the pcap files, but you mean edit/save of the connection. I will need to redo that and reupload.

I’ll look, but what I need is a capture where you didn’t edit/save the connection in Ignition so soon after restarting the server.

Okay, I think I got it. File is mergedCaptureOPC_Ign_FM_NoEdit-Save.pcap

Sequencing / Timestamps (approx):
0-30s Normal operation
30s-40s Our OPC server restarting
73s: Ignition Indicates connection faulted.
150s: end of capture

Sorry, it’s better, but it’s still not showing any of the same stuff that happened in those previous logs. It just looks like it reconnects and then just does nothing but some browsing and keep-alive reads maybe? There’s no PublishRequests or anything like that happening after the reconnect. Maybe I’m looking at the wrong traffic?

edit: another thing - do you have any non-default settings configured on these connections in Ignition? Like changes to the keep alive settings or configuring a “failover” endpoint to the same server or something?

I noticed your server based on FreeOpcUa so I did a little testing and I’m seeing some similar results after stopping and starting the example server.

What I’m seeing is that after receiving some unexpected requests from Ignition after it reconnects (highlighted in the screenshot) the server “crashes”, logs some errors, and then basically stops responding at all until a new connection is forced by edit/save in Ignition.

That Ignition sends these previously in-flight requests immediately upon reconnecting but before the session is established is a known issue, but all that should happen is that the server should close the connection and then Ignition reconnects and proceeds like normal.

Instead what happens is this seems to crash the server and render the secure channel unusable.

I’m attaching a pcap that shows it as well: freeopcua-example-server.pcap (129.2 KB)

If your server isn’t actually based on FreeOpcUa then perhaps this isn’t relevant.

The strange thing is that this behavior is similar to what I see in the recent captures you uploaded, but it’s not at all similar to the Ignition logs you uploaded, where it was clear that the server did the “right thing” upon receiving these requests and closed the connection and then things proceeded to fall apart a little further when it thought the session was invalid.

edit: mixed your logs up with another customer that support is working with on a reconnect issue. That makes things less confusing, as my captures match your captures and your logs.

2 Likes

Kevin,

Thanks for going beyond +ultra on this one. I’m going to get with our internal teams and see what we can do in light of your findings.

-Fe