I have a project that reading and writing to a PLC However, when the PLC goes offline, either from maintenance or scheduled downtime, if it’s down for an extended time, we think more that 20 minutes, the drive goes to status of idle and never comes out of it until we disable and re-enable the device. Is this normal behavior and how can I prevent this?
Is the PLC connected with an Ignition OPC-UA driver or an external service like Kepware?
The Igntion OPC-UA driver. It’s the Logix Driver V21+.
We’ve recently just seen this issue too. Ignition 8.0.16, Logix Driver against a v30 firmware L85… Thankfully this little script will “fix” the issue, but I’d not advise putting something like this in as a watchdog if possible:
devices_pydataset = system.dataset.toPyDataSet(system.device.listDevices())
idle_devices = [x['Name'] for x in devices_pydataset if x['State'] == 'Idle']
for device in idle_devices:
system.device.setDeviceEnabled(device, 0)
system.device.setDeviceEnabled(device, 1)
Anyone else come across this? We’ll probably be connecting with support on it, but thought I’d revive the topic…
We have seen this before too and use a very similar fix to ensure comms returns when it can
If this happens regularly enough for you guys is there any chance you can leave the logger for com.digitalpetri.enip.ChannelFsm
on DEBUG level and get the logs next time it happens?
Triple extra bonus points if you can also leave a filtered Wireshark capture running that captures it happening as well.
Checking back in on this issue–we just had it happen again in our Dev environment (also happening in Prod, fwiw). This particular PLC does have the highest tag count of any of the ones we’ve got, though in Dev we’ve only got a couple devices enabled (since we don’t have all the PLCs to connect to here).
It is showing Idle
state, the other Logix PLC we have enabled is showing Connected
and everything with it is functioning as intended. I enabled DEBUG
across all of the com.digitalpetri.enip
loggers (including ChannelFsm
) and oddly enough, no output from those loggers is produced (even when the other device is still chatting along nicely).
I followed up with some tcpdump’s against the gateway–one filtered against the working PLC, another against the Idle
one… I’ve got NOTHING going out or coming in from the Idle
PLC… The other is busily chatting away (20000 packets in just around 10 seconds time)…
I’ll try to setup a rolling capture next to get some of the lead-up to the event for further analysis.
I think a capture + logs taken during the lead up to this happening is going to be key to figuring it out.
Sounds like a plan... Thankfully, when the event occurs, all traffic seems to stop, meaning I won't have to be too quick on the rolling capture. I'll keep everyone posted!
… and in the event someone finds this thread due to tcpdump
, know that if you’re wanting to setup a rolling capture on an Ubuntu host, you will likely need to modify the AppArmor profile (active by default, and present for tcpdump
) to allow .pcap[0-9]
extensions. First sudo apt-get install apparmor-utils
, then modify the /etc/apparmor.d/usr.sbin/tcpdump
file to replace:
# for -r, -F and -w
/**.[pP][cC][aA][pP] rw,
with
# for -r, -F and -w
/**.[pP][cC][aA][pP][0-9]* rw,
Then, reprocess that file by running sudo aa-enforce tcpdump
. The default config only allows writing out of .pcap
files instead of ones with an integer suffix. This will allow a rolling capture, such as:
$ sudo tcpdump -vv -X -C 10 -W 10 -Z kcollins -w clx-traffic.pcap -i veth85914e1 host 10.11.2.42
… to successfully write out .pcap0
, .pcap1
, etc files…
@Kevin.Herron, we observed the situation again and have collected all of the data you had identified…
Here are the log entries:
INFO | jvm 1 | 2020/10/22 08:33:32 | D [c.d.e.ChannelFsm ] [13:33:32]: [7] channel idle, maxIdleSeconds=15
INFO | jvm 1 | 2020/10/22 08:33:32 | D [c.d.e.ChannelFsm ] [13:33:32]: [7] S(Connected) x E(ChannelIdle) = S'(Connected)
INFO | jvm 1 | 2020/10/22 08:33:33 | D [c.d.e.ChannelFsm ] [13:33:33]: [7] exceptionCaught() local=/192.168.0.2:40784, remote=/10.11.2.42:44818
INFO | jvm 1 | 2020/10/22 08:33:33 | java.io.IOException: Connection reset by peer
INFO | jvm 1 | 2020/10/22 08:33:33 | at java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method)
INFO | jvm 1 | 2020/10/22 08:33:33 | at java.base/sun.nio.ch.SocketDispatcher.read(Unknown Source)
INFO | jvm 1 | 2020/10/22 08:33:33 | at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
INFO | jvm 1 | 2020/10/22 08:33:33 | at java.base/sun.nio.ch.IOUtil.read(Unknown Source)
INFO | jvm 1 | 2020/10/22 08:33:33 | at java.base/sun.nio.ch.IOUtil.read(Unknown Source)
INFO | jvm 1 | 2020/10/22 08:33:33 | at java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source)
INFO | jvm 1 | 2020/10/22 08:33:33 | at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253)
INFO | jvm 1 | 2020/10/22 08:33:33 | at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133)
INFO | jvm 1 | 2020/10/22 08:33:33 | at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
INFO | jvm 1 | 2020/10/22 08:33:33 | at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
INFO | jvm 1 | 2020/10/22 08:33:33 | at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
INFO | jvm 1 | 2020/10/22 08:33:33 | at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
INFO | jvm 1 | 2020/10/22 08:33:33 | at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
INFO | jvm 1 | 2020/10/22 08:33:33 | at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
INFO | jvm 1 | 2020/10/22 08:33:33 | at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
INFO | jvm 1 | 2020/10/22 08:33:33 | at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
INFO | jvm 1 | 2020/10/22 08:33:33 | at java.base/java.lang.Thread.run(Unknown Source)
INFO | jvm 1 | 2020/10/22 08:33:33 | D [c.d.e.ChannelFsm ] [13:33:33]: [7] channelInactive() local=/192.168.0.2:40784, remote=/10.11.2.42:44818
INFO | jvm 1 | 2020/10/22 08:33:33 | D [c.d.e.ChannelFsm ] [13:33:33]: [7] S(Connected) x E(ChannelInactive) = S'(Idle)
Attached is the packet capture (already filtered to/from the PLC IP, 10.11.2.42
) leading up to and including the transition event. I’m going to engage support to have a formal ticket worked up–I’ll reference this thread for their input.
clx02-traffic.pcap13.zip (1.1 MB)
PS: The forum doesn’t allow *.gz
uploads? Sup wit dat?
Can you give me the full Ignition logs leading up to it as well? DM them if you feel they are sensitive.
Mmm, also strange is timestamps in the capture do not line up with the log snippet you posted, which would explain why I can’t find the closing of the TCP connection that we’re seeing there in it either.
yeah, there might be a UTC offset issue… gateway itself is TZ=America/Chicago
I only see 17 seconds of packet capture, ignoring the hour, from x:33:00 to x:33:17.
I do have more historical captures that go further back, this is the latest one (others roll every 10MB). I’ll send you a link to the full set in DM.
Well I’d expect the traffic I’m after to be in a future buffer… unless I’ve got a truncated or corrupt version of the capture. Wireshark does complain that it cut off in the middle of a packet…
Hopefully it isn’t truncated, though on further thought I did copy out the packet captures before terminating tcpdump
… Sure enough, I get the same output with tcpdump -r ...
:
08:33:17.362169 IP (tos 0x6c, ttl 60, id 29463, offset 0, flags [DF], proto TCP (6), length 40)
10.11.2.42.44818 > 192.168.0.2.40784: Flags [.], cksum 0xb845 (correct), ack 2157298, win 7824, length 0
0x0000: 0242 c0a8 0002 0242 641e 14f0 0800 456c .B.....Bd.....El
0x0010: 0028 7317 4000 3c06 fe6d 0a0b 022a c0a8 .(s.@.<..m...*..
0x0020: 0002 af12 9f50 1fb2 6767 c683 701f 5010 .....P..gg..p.P.
0x0030: 1e90 b845 0000 ...E..
tcpdump: pcap_loop: truncated dump file; tried to read 596 captured bytes, only got 594
Sorry about that… Hopefully it is still enough to get some useful insights… It seems to me that at the very least, the driver needs to catch that exception and issue a reconnect.
I’m not sure what’s happening yet. Maybe the capture is correct, and the very last thing that happens is Ignition sends a request and gets a TCP ACK for it. Then it’s radio silence.
15 seconds later in the logs you see the channel idle message. This would have resulted in a keep-alive ListIdentity attempting to be sent. This apparently isn’t happening because the connection is FUBAR at this point. 1 second later we get the exception and the channel FSM goes to Idle state.
All of that is relatively normal though. What’s not normal and what I can’t figure out yet is why nothing in the driver appears to be executing or running any more. What you’d expect soon after this is GetChannel event on the FSM the next time anything tried to get the channel, which would kick off off the reconnect process. It would look something like this:
D [c.d.e.ChannelFsm ] [14:17:20]: [6] exceptionCaught() local=/10.100.70.5:64150, remote=/10.20.4.57:44818
java.io.IOException: Connection reset by peer
at java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276)
at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:233)
at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:223)
at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:358)
at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133)
at io.netty.buffer.WrappedByteBuf.writeBytes(WrappedByteBuf.java:821)
at io.netty.buffer.AdvancedLeakAwareByteBuf.writeBytes(AdvancedLeakAwareByteBuf.java:635)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.base/java.lang.Thread.run(Thread.java:834)
D [c.d.e.ChannelFsm ] [14:17:20]: [6] channelInactive() local=/10.100.70.5:64150, remote=/10.20.4.57:44818
D [c.d.e.ChannelFsm ] [14:17:20]: [6] S(Connected) x E(ChannelInactive) = S'(Idle)
D [c.d.e.ChannelFsm ] [14:17:21]: [6] S(Idle) x E(GetChannel) = S'(Reconnecting)
D [c.d.e.ChannelFsm ] [14:17:21]: [6] S(Reconnecting) x E(GetChannel) = S'(Reconnecting)
D [c.d.e.ChannelFsm ] [14:17:21]: [6] S(Reconnecting) x E(ConnectSuccess) = S'(Connected)
But in the case where it’s stuck in Idle it seems everything has just come to a halt…
I’ve seen a similar scenario happen. I have a gateway connected to 3 PLCs for development, and one of them (the one with the most tags being interacted with) goes IDLE occasionally.
After enabling DEBUG on ChannelFsm, I saw this come in.
INFO | jvm 2 | 2020/10/23 15:41:57 | D [c.d.e.ChannelFsm ] [19:41:57]: [8] S(Connecting) x E(ConnectSuccess) = S'(Connected)
INFO | jvm 2 | 2020/10/23 15:42:17 | D [c.d.e.ChannelFsm ] [19:42:17]: [8] channel idle, maxIdleSeconds=15
INFO | jvm 2 | 2020/10/23 15:42:17 | D [c.d.e.ChannelFsm ] [19:42:17]: [8] S(Connected) x E(ChannelIdle) = S'(Connected)
INFO | jvm 2 | 2020/10/23 15:42:17 | D [c.d.e.ChannelFsm ] [19:42:17]: [8] exceptionCaught() local=/10.150.90.223:60622, remote=/10.150.90.208:44818
INFO | jvm 2 | 2020/10/23 15:42:17 | io.netty.handler.codec.DecoderException: java.lang.RuntimeException: unhandled item type: 0x86
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.handler.codec.ByteToMessageCodec.channelRead(ByteToMessageCodec.java:103)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
INFO | jvm 2 | 2020/10/23 15:42:17 | at java.base/java.lang.Thread.run(Unknown Source)
INFO | jvm 2 | 2020/10/23 15:42:17 | Caused by: java.lang.RuntimeException: unhandled item type: 0x86
INFO | jvm 2 | 2020/10/23 15:42:17 | at com.digitalpetri.enip.cpf.CpfItem.decode(CpfItem.java:77)
INFO | jvm 2 | 2020/10/23 15:42:17 | at com.digitalpetri.enip.cpf.CpfPacket.decode(CpfPacket.java:49)
INFO | jvm 2 | 2020/10/23 15:42:17 | at com.digitalpetri.enip.commands.ListIdentity.decode(ListIdentity.java:53)
INFO | jvm 2 | 2020/10/23 15:42:17 | at com.digitalpetri.enip.EnipPacket.decodeCommand(EnipPacket.java:165)
INFO | jvm 2 | 2020/10/23 15:42:17 | at com.digitalpetri.enip.EnipPacket.decode(EnipPacket.java:126)
INFO | jvm 2 | 2020/10/23 15:42:17 | at com.digitalpetri.enip.codec.EnipCodec.decode(EnipCodec.java:30)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.handler.codec.ByteToMessageCodec$1.decode(ByteToMessageCodec.java:42)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:498)
INFO | jvm 2 | 2020/10/23 15:42:17 | at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:437)
INFO | jvm 2 | 2020/10/23 15:42:17 | ... 21 common frames omitted
INFO | jvm 2 | 2020/10/23 15:42:17 | D [c.d.e.ChannelFsm ] [19:42:17]: [8] channelInactive() local=/10.150.90.223:60622, remote=/10.150.90.208:44818
INFO | jvm 2 | 2020/10/23 15:42:17 | D [c.d.e.ChannelFsm ] [19:42:17]: [8] S(Connected) x E(ChannelInactive) = S'(Idle)
INFO | jvm 2 | 2020/10/23 15:42:19 | D [c.d.e.ChannelFsm ] [19:42:19]: [8] S(Idle) x E(KeepAliveFailure) = S'(Idle)
I’ve reset the connection again, and now it’s still running connected so far.