Modbus TCP comms issues

AlThePal · April 2, 2021, 6:16pm

We have a customer with >100 Modbus TCP devices being read by Ignition, mostly IDEC PLCs. Comms can be fine for days and then suddenly they get loads of errors with a dozen devices indicating they have lost comms. After a few minutes these clear and another handful of devices go into alarm. This can go on for days.

When I looked at the traffic on the network with Wireshark when comms are OK, there is a regular pattern like the following:

[Ignition server] Request for data
[PLC] Acknowledgement of request
[PLC] Response from PLC
[Ignition server] Acknowledgement of response

When comms become unreliable, generally the final acknowledgement from the Ignition server is missing (i.e. it goes straight into the next request without issuing an ACK), although Wireshark says the [PSH, ACK] from the following request is acting as the ACK for the response. The Ignition server also occasionally retransmits the request a few times before the PLC responds.

As with any comms problem the issue could be with any part: the Ignition server, the PLC or the network in between. Given that I was running Wireshark on the Ignition server, it seems like the Ignition server is genuinely not producing the second ACK, so currently it is under the most suspicion.

Can anyone at IA shed any light on this? The problem does seem to have got worse with the increasing number of devices. Each device’s options are set at the default except the timeout is 5000 mS, the maximum number of Holding Registers per request is 64 and the maximum number of Coils per request is 128 (a limitation of some of the IDEC PLCs). The server is very lightly loaded and normally sits at 1% cpu and less than 2GB of memory (it has 128GB). It has a 1GB network connection.

Kevin.Herron · April 2, 2021, 6:20pm

I’m not sure this will be of much help, but I can tell you that any retries (assuming you are talking about TCP retransmission in Wireshark) or any TCP-level ACKs that get sent or don’t get sent are not things that Ignition is responsible for - they happen in your OS’s TCP stack.

edit: if you have captures of good and bad traffic I’d be interested in taking a look.

Kevin.Herron · April 2, 2021, 6:31pm

It looks like some of the older drivers (like Modbus) aren’t setting TCP_NODELAY… wonder if that would help

AlThePal · April 2, 2021, 6:35pm

Hi Kevin, the attached pcap file is fairly typical of what we’re seeing. It starts out not too bad then degenerates, goes through a few cycles of issues and clears up by the end:

modbus.pcap (34.7 KB)

Kevin.Herron · April 2, 2021, 6:40pm

Not sure. Most likely cause to me seems like it would be packet loss or network congestion. You’re capturing traffic on the Ignition server here, but is it possible for you to view and capture all traffic on the network and see if there are any abnormal occurrences around the time these retransmissions begin?

AlThePal · April 2, 2021, 6:49pm

That’s a possibility, but I was wondering in that case why the Ignition server would stop acknowledging the responses. We should see the ACK locally, even if it was held up on the network.

Kevin.Herron · April 2, 2021, 6:49pm

It hasn’t stopped, they’re just piggybacking on the next request.

AlThePal · April 2, 2021, 6:51pm

I couldn’t find any mention of acknowledgements in the Modbus spec, so I assume this is just TCP-level stuff. In that case the most important issue is the retries that are happening because the PLC is not responding in a timely fashion.

Kevin.Herron · April 2, 2021, 6:53pm

All of this ACK and retransmission stuff is TCP-level.

In the capture you provided there are no delays in response long enough for the driver to decide a request has timed out, but there are some that have 1-2 seconds worth of retransmissions before it gets sent and receives a response.

pturmel · April 2, 2021, 7:06pm

Could also be a switch dropping packets in either direction. Due to congestion, perhaps. Or transient failures in the network. Or marginal cabling. Managed switches can usually report their corrupt packet drop counters. Or the switch itself is dying.

Kevin.Herron · April 2, 2021, 7:08pm

I always tell support reps to ask if a forklift ran over any Ethernet cables recently. Guess why.

(I still wonder why in this particular case the cables were even in a position that made this possible…)

dcamp1 · April 2, 2021, 7:17pm

Great point “pturmel” makes!!

AlThePal · April 2, 2021, 7:40pm

This is a substantial network with hundreds of devices spread across many square miles involving lots of radio links.

We’ve arranged a meeting with the customer to carry out a review of the network configuration. We’ll make sure this includes monitoring of links to check on things like dropped or corrupted packets.

Thanks for your help.

pturmel · April 2, 2021, 7:48pm

Eww! Radios and TCP/IP are a notorious combination for dropped packets and stalls.

d.filos · April 7, 2021, 10:07am

For reference I would like to mention my own experience from dealing with 120 Modbus devices connected wirelessly through an LTE private network. As @pturmel points out things are tricky. Every day each device is “disconnected” from the network multiple times for a few seconds and by “disconnected” I mean everything that will affect the packets transmission (i.e. data roaming). In such events Modbus driver polling service will receive no answer and log the respective failure type “TIMEOUT” so many times as the transmitted requests. Most of the times the connection is recovering and the next retry finds its way however if the network is down for more seconds it is causing the driver to report the device as “DISCONNECTED”. As you can imagine with 120 devices, approximately 5 Modbus request per device and 5 to 20 network “disconnections” per device I have an enormous size of warning messages reported to log which apart from frustrating makes difficult to identify other type of events. Till know, I haven’t figure out a solution or a workaround to this other than increasing the polling period which only affects the number of logged events. One thing that I will investigate is “moving” some devices to a Kepware OPC server and compare the behaviour since at the specific project the forecast is to double the devices through the year!

Harikrishna_Patadiya · April 9, 2021, 1:49pm

Here’s are a few thoughts. I’ve run into modbus devices that are inherently slow. So if you keep polling for data at a rate faster than the rate at which it can process them, the buffer of requests fills up and eventually closes the connection, clears the buffer and starts over. That’s when you might be experiencing the outages.

If that is the case, you’ll just have to play with poll rates to find out what is an acceptable rate.

Additionally, do you have the same modbus device set up as two (2) ignition devices? Generally people do that to get some data in swapped bytes.
If that is the case, you’re also increasing the amount of data requests to the end devices.

Another thing, sockets. Are you sure you’re not running out of sockets? And are you sure the devices aren’t running out of sockets?

Finally, are you sure you have tags set up for each device at least? Modbus protocol disconnects the device is no requests are made and ignition will detect that as a disconnect so it’ll send a reconnect and you’ll keep seeing this dance of disconnected/reconnecting.

Good luck!