Gateway Network Timout Issues

I'm helping a customer troubleshoot a strange issue that we've seen a few times, and I'm hoping that someone here has seen it or knows what is going on.

There are three separate gateway systems, both redundant. Gateway A and B botch have an outgoing connection to Gateway C.

During the normal course of activities on Gateway A and B, they do a system.util.sendRequest to Gateway C and expect a response back from Gateway C. Some time over the past day, something happened and the connection between Gateway A and Gateway C was disconnected and then re-connected.

The gateway was showing all the connections as running, and the tag in the system folder was showing connected.

However, and this is where the problem is, any sendRequest or sendMessage to Gateway C from Gateway A failed on timeout.

I was able to get things going again by resetting the GAN from Gateway A to Gateway C in Gateway A's status page. After that all the things went back to communicating.

Is there anything I can do to try and diagnose what is going on? This doesn't happen often, however when it does it takes someone to reset the connections where there doesn't appear to be anything wrong from a status point of view.

Ideas?

Sounds like a bug to me. Open a ticket?

The problem is that it is something that isn't really repeatable, and when it is broke their process is hamstrung until it is back up and running.

I'm hoping to try and get some troubleshooting logs etc... that I can grab and send them into support when I open the ticket.

I'll bet there's some bug that kicks in when requests are unable to return their response to a calling gateway.

Try setting up a message handler on "C" that just waits 30 seconds and returns some arbitrary object.

Then fire a request from "A", wait 15 seconds, then break the network until the original request times out. Repeat a few times if necessary.

(Well, probably not on your production gateways.)

Yeah I'll have to set something up in the lab and see if I can reproduce it.

I'm taking note that this is probably the only time @pturmel has ever recommended using something like sleep to delay execution of a script!
:exploding_head:

I'm fine with sleep() in dedicated threads and asynchronous threads. And when deliberately trying to break Ignition. :man_shrugging: