[BUG-15788] Allen Bradley Device Failing in OPC-UA but still showing as running,

accyroy · January 13, 2020, 5:03pm

We are running ignition 8.0.7 with various Allen Bradley PLCs version 28 and version 32. We have random devices dropping with tags that go bad. Browsing the OPC-UA server we see that there are no tags available from that device:

However in the device status it still shows as connected:

We have to manually disable and re-enable the device for it to come back.

Every day a different PLC will fail in this way. I’ve spoke to a few of my colleagues and they have witnessed this behaviour at other plants, but usually only after a server restart. We have not restarted our server for over a week, but everyday we lose a device.

I’ll call support this afternoon, but just wondered if anyone else has experienced this.

Kevin.Herron · January 13, 2020, 5:07pm

Haven’t seen anything like this. Find the loggers for LogixBrowseState and LogixBrowseStateManager next time it happens and turn those to TRACE.

accyroy · January 13, 2020, 5:21pm

Good timing, as one PLC just went down.

Found this, and this time it self recovered after about 10 minutes. The PLC was up during this time and the driver said it was Connected.

Kevin.Herron · January 13, 2020, 5:24pm

Well, it was connected. The problem is for some reason the PLC isn’t responding to the browse request.

Browsing uses a form of communication that does not “reserve” resources on the PLC the way the tag reads and writes do (CIP connected vs unconnected messaging), so if the PLC is overloaded in some way this could be responsible for the browse requests being ignored.

I’m not sure why restarting a device connection would have any effect other than that maybe it gives the PLC a slight break in the requests from us (and a new logical TCP connection, maybe there’s some kind of bug in the firmware there).

accyroy · January 13, 2020, 5:30pm

I’m going to start monitoring this via alarming. This PLC is really not overloaded at all. It also happens to any PLC here at random times, so seems either a network issue, which isn’t likely as I can connect via RSLinx while this is down or maybe something in the AB driver. We are seeing this on other sites as well.

Kevin.Herron · January 13, 2020, 5:32pm

It might help if you can start a Wireshark capture on the server running the Ignition gateway next time it’s happening as well.

Phil_B · January 13, 2020, 6:27pm

I am at one of the other site @accyroy mentioned is seeing the same thing. Today it was 2 different PLCs using the Allen-Bradley Logix Driver. Both the PLCs have about 50 tags setup in ignition. When troubleshooting i noticed the device was listed as “Connected” but when browsing the OPC browser, the device was listed but when i attempted to view the tags within, there was nothing. To solve the issue i went into the device configuration and changed nothing and hit save.then the device reinitialized and all tags went back to good quality.

accyroy · January 13, 2020, 7:07pm

Phil are you on 8.0.7 as well?

Phil_B · January 13, 2020, 7:18pm

7.9.12

accyroy · January 13, 2020, 8:55pm

I just had to reboot the server and when it came back up two of the PLC’s are in this state.

Fortunately one of the PLC’s isn’t a critical piece of equipment so I don’t need to get it back online immediately. The other one I disabled and re-enabled and it is now fine again. I have done a Wireshark capture, attached. I’m currently waiting for a callback from support, will update if we find anything interesting.

Kevin.Herron · January 13, 2020, 9:04pm

The only interesting thing I see in this capture is that you’ve got 2 pieces of software on the same machine talking to the PLC. One of them is definitely not Ignition.

The traffic from the connection that looks like Ignition looks fine during the capture - it’s just the periodic requests that happen every 5 seconds. Do you know for sure that one of those LogixBrowse warnings happened while this Wireshark capture was recording and that the device is the one in the capture?

accyroy · January 13, 2020, 9:07pm

Yes it was down when I took this capture and it is still down now, the other software talking to the PLC is RSLinx, which can see and talk to the PLC fine. The device is connected to the network and still not showing any tags in the OPC-UA browser.

accyroy · January 13, 2020, 9:11pm

Just re-read your question, and the LogixBrowse warning did not happen during the capture, we get that warning then the comms are gone. The capture is taken while the device has no tags in the OPC-UA browser.

Kevin.Herron · January 13, 2020, 9:15pm

Ah, I need a capture when the timeout event happens, if possible. I need to try to match a timeout and its senderContext to a packet in Wireshark and see if the PLC isn’t responding or if something else is going on. Do you still have the LogixBrowse loggers on? Do the warnings keep happening while the device has no tags?

accyroy · January 13, 2020, 9:26pm

I’m running Wireshark now, I’ll upload another capture if we lose another PLC.
I still have the LogixBrowse loggers on, there are no more entries concerning the device with no tags. It’s as though the driver has timed out and isn’t trying again.

accyroy · January 13, 2020, 10:53pm

So restarting the gateway forces several random PLC’s to have this happen to them. So I restarted whilst capturing the packets. Three PLC’s failed on restart, attached are logs with correct timestamps, see images to tally times, faults and captures.

Kevin.Herron · January 13, 2020, 11:16pm

Thanks, I think these captures have what I need. I can see the last browse request Ignition sends before the PLC stops responding and then no more browse requests after that - just the status requests.

I’ll have to dig into this over the next couple days and see if I can reproduce it here somehow.

accyroy · January 13, 2020, 11:21pm

Thanks Kevin, just FYI, two of those PLC’s in the capture are version 28.012 and the other one is 32.011.
If you’d like to connect to our server let me know. I’m on site here for the rest of this week.

Kevin.Herron · January 13, 2020, 11:40pm

Hmm, so I found the change that introduced what you’re seeing.

A little while ago we noticed that the program in these new logix processors can get into an inconsistent state where it will tell us a symbol or template or member exists but then when a template read (part of the browse process) is attempted it fails saying it doesn’t exist.

In response to this we made it so that failure while browsing the global scope or any individual program doesn’t fail the whole browse process.

Unfortunately what you’re seeing here is the browse of the global scope failing due to a timeout and then us saying carrying on anyway as if it were okay.

So I either need to exclude global scope from this logic, exclude failures due to timeout from this logic, or probably both.

accyroy · January 13, 2020, 11:51pm

Thanks for confirming. I’ll write a script using system.device.setDeviceEnabled for now to restart the device on bad quality.