Gateway drops conenctions, stops responding

I am running Ignition 7.5.3-linux-x64 on Ubuntu 12.04.1-server-amd64 kernel 3.2.0-29-generic.

java version “1.6.0_24”
OpenJDK Runtime Environment (IcedTea6 1.11.3) (6b24-1.11.3-1ubuntu0.12.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

I am having two problems, the first causing the second.

First off, Ignition is losing connectivity w/ my Modbus devices. It looks like this is a PLC problem, because when I sniff the network I see the PLC sending a TCP RST. In Ignition logs I get:

82750-INFO   | jvm 3    | 2012/10/28 00:35:29 | ERROR [SocketIODelegate              ] [00:35:29,215]: [hostname=192.168.1.58,port=502] Socket connection closed, DriverState was Connected.
82751-INFO   | jvm 3    | 2012/10/28 00:35:29 | java.net.SocketException: Connection reset
82752-INFO   | jvm 3    | 2012/10/28 00:35:29 |         at java.net.SocketInputStream.read(SocketInputStream.java:185)
82753-INFO   | jvm 3    | 2012/10/28 00:35:29 |         at java.net.SocketInputStream.read(SocketInputStream.java:107)
82754-INFO   | jvm 3    | 2012/10/28 00:35:29 |         at com.inductiveautomation.common.iosession.socket.AsyncSocketIOSession.run(AsyncSocketIOSession.java:92)
82755-INFO   | jvm 3    | 2012/10/28 00:35:29 |         at java.lang.Thread.run(Thread.java:679)

I am directly on the network, no routers or firewalls between. We can blame our equipment for now. The issue is this leads to the Ignition server not responding.

I get

82756-INFO   | jvm 3    | 2012/10/28 00:35:33 | ERROR [ModbusDriver2[PLC12B]         ] [00:35:33,700]: Error during connect.
82757:INFO   | jvm 3    | 2012/10/28 00:35:33 | java.net.SocketException: Too many open files
82758-INFO   | jvm 3    | 2012/10/28 00:35:33 |         at java.net.Socket.createImpl(Socket.java:414)
82759-INFO   | jvm 3    | 2012/10/28 00:35:33 |         at java.net.Socket.connect(Socket.java:544)
82760-INFO   | jvm 3    | 2012/10/28 00:35:33 |         at com.inductiveautomation.xopc.driver.api.SocketIODelegate.connect(SocketIODelegate.java:75)
82761-INFO   | jvm 3    | 2012/10/28 00:35:33 |         at com.inductiveautomation.xopc.driver.api.AbstractIODelegatingDriver.connect(AbstractIODelegatingDriver.java:17)
82762-INFO   | jvm 3    | 2012/10/28 00:35:33 |         at com.inductiveautomation.xopc.driver.api.AbstractDriver$5.run(AbstractDriver.java:481)
82763-INFO   | jvm 3    | 2012/10/28 00:35:33 |         at com.inductiveautomation.ignition.common.execution.impl.BasicExecutionEngine$ThrowableCatchingRunnable.run(BasicExecutionEngine.java:526)
82764-INFO   | jvm 3    | 2012/10/28 00:35:33 |         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
82765-INFO   | jvm 3    | 2012/10/28 00:35:33 |         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
82766-INFO   | jvm 3    | 2012/10/28 00:35:33 |         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
82767-INFO   | jvm 3    | 2012/10/28 00:35:33 |         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:165)
82768-INFO   | jvm 3    | 2012/10/28 00:35:33 |         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:266)
82769-INFO   | jvm 3    | 2012/10/28 00:35:33 |         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
82770-INFO   | jvm 3    | 2012/10/28 00:35:33 |         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
82771-INFO   | jvm 3    | 2012/10/28 00:35:33 |         at java.lang.Thread.run(Thread.java:679)

in my logs. Checking lsof on the Ignition process I get hundreds of line of:

java    2511 ignition 4065u  sock                0,7      0t0 17076417 can't identify protocol
java    2511 ignition 4066u  sock                0,7      0t0 17076418 can't identify protocol
java    2511 ignition 4067u  sock                0,7      0t0 17076425 can't identify protocol
java    2511 ignition 4068u  sock                0,7      0t0 17076507 can't identify protocol
java    2511 ignition 4069u  sock                0,7      0t0 17076636 can't identify protocol
java    2511 ignition 4070u  sock                0,7      0t0 17076488 can't identify protocol
java    2511 ignition 4071u  sock                0,7      0t0 17076497 can't identify protocol
java    2511 ignition 4072u  sock                0,7      0t0 17076498 can't identify protocol

Any thoughts on why when the connection gets reset does Ignition/Java not clean up?

Perhaps Ignition is not returning the ACK’s to the device or they are getting lost on the way. IIRC the TCP RST is sent when one device has lost track of the TCP session and wants to resynchronize with the other party - most often happening when it isn’t getting ACK’s back.

Any chance you’d want to post the Wireshark dumps so we have some context?

Here is a pcap file I just captured. 192.168.1.4 is an Ignition box, and the other addresses are Modbus slaves.

I had to zip it because the forum did not like the .pcapng extension.

The gateway <-> PLC communication problem exists on both my Linux and Windows 2008 Ignition gateways (testing both, wanting to deploy on Linux), but I have not seen the Windows server give me the “java.net.SocketException: Too many open files” and stop working.

This pcap is from the Windows machine.

Another thing I forgot to mention, the ignition servers are both running on an ESXi 4.1 VMware host.
pcap.zip (4.05 KB)

One thing I found that appears to have resolved the first problem: I was not polling the PLCs, they were just connecting, not reading data. I built 10 tags for each PLC, and once they were being polled they stopped disconnecting.

Since the first problem is resolved, the second is not occurring anymore. It is still odd to me that the dropped connections cause a resource leak (sockets in this case). I am still a little nervous, because over a long time I do not know if this leak will still occur slowly and eventually lead to running out of FDs again… I’ll just have to keep an eye on the count of open FDs. Perhaps an upgrade to openjdk-7-jre would fix things.

Almost all Modbus devices I’ve encountered will close the TCP connection after 10-15 seconds of inactivity.

I’ll take a look at the socket stuff, but I suspect that the issue is that the sockets remain in a TIMEWAIT state for longer than the period at which the Modbus driver connects and then has its connection closed by the PLC due to inactivity, leading to an ever-increasing number of open file handles.