7.5.10 upgrade issues

dfenter · August 21, 2013, 3:26pm

I have upgraded to 7.5.10, and I keep getting the same error:

xopc.client.stack.UaClient.PublishRequestPump

StatusCode[Severity=Bad, Subcode=Bad_InternalError]: ServiceFault: StatusCode[Severity=Bad, Subcode=Bad_InternalError]
at com.inductiveautomation.xopc.client.stack.TCPClientChannel.validateResponseType(TCPClientChannel.java:833)
at com.inductiveautomation.xopc.client.stack.TCPClientChannel.receiveMessage(TCPClientChannel.java:794)
at com.inductiveautomation.xopc.common.stack.UAChannel$1DeliverMessage.deliver(UAChannel.java:967)
at com.inductiveautomation.xopc.common.stack.UAChannel$DeliverToDelegate.run(UAChannel.java:1592)
at com.inductiveautomation.xopc.client.stack.SerialExecutionQueue$RunnableExecutor.execute(SerialExecutionQueue.java:84)
at com.inductiveautomation.xopc.client.stack.SerialExecutionQueue$RunnableExecutor.execute(SerialExecutionQueue.java:81)
at com.inductiveautomation.xopc.client.stack.SerialExecutionQueue$PollAndExecute.run(SerialExecutionQueue.java:59)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

Kevin.Herron · August 21, 2013, 3:30pm

Can you attach or send to support the logs.bin.gz file exported from the Console area of the gateway? There’s not enough context to figure out what’s going on here from the stack trace alone.

dfenter · August 21, 2013, 4:10pm

Here are the logs

Kevin.Herron · August 21, 2013, 4:21pm

Oooph. You’ve got a bunch of pretty bad things going on.

It looks like you’ve got a number of remote UA connections, some of which are not responding to read requests for the server’s current timestamp (this is just used as a sanity check, it should be basically instantaneous…) in time, which causes the connection to be reset.

Additionally, some of your remote servers are not responding with keep-alive or data publish responses within the timeout established upon connection either, which is leading to timeouts on the publish requests also.

Right now I’m not really sure how to address any of this. The connection(s) must be of really poor quality for this to be occurring, as I’ve seen plenty of remote UA connections before but have not seen anything like this happening.

dfenter · August 21, 2013, 4:41pm

I have been resetting the OPC servers to clear the errors.

It is only affecting the sites with PCS7 systems.

Dan

Kevin.Herron · August 21, 2013, 4:42pm

[quote=“dfenter”]
It is only affecting the sites with PCS7 systems.
Dan[/quote]

What are PCS7 systems?

dfenter · August 21, 2013, 4:57pm

As for the quality of the OPC connections, all our sites are on T1 or fiber networks, and they are running the KEPServerEX program. Tags seem to be all showing good quality after reinitializing the OPC servers at the sites.

Dan

dfenter · August 21, 2013, 4:58pm

Siemens PCS7 control systems…

Kevin.Herron · August 21, 2013, 5:00pm

Maybe the Kepware servers at the other end of these problematic connections are overloaded or something. Do you know what Kepware version they’re all running?

I can try giving you a custom build that has exaggerated values for the read request and publishing timeouts to see if it alleviates the problem, but it’s not a viable longterm fix. If the connection is good there’s really no excuse for the server to fail to respond within the current timeouts as they’re already pretty generous.

dfenter · August 21, 2013, 6:35pm

Both sites had power outages, and when they came back online we get the errors.

dfenter · August 21, 2013, 6:37pm

It seems like every time I upgrade the software, a new problem pops up.

We may just downgrade to the last version that was working.

dfenter · August 21, 2013, 6:46pm

How can I tell which server is causing the issues? It is not indicated in the log viewer.

Kevin.Herron · August 21, 2013, 7:09pm

Irwindale_Connection and DeLisle_Connection both experience timeouts.

It looks like Lakeland and KEP_Connection both had brief/intermittent issues, but self-corrected.

I need to get these log messages to include the connection name somehow… very difficult to tell what’s going on.

dfenter · August 22, 2013, 12:57pm

Is there a way you can disable that you can disable the function that makes the OPC server disconnect if the timestamp check fails.

Specifically this function: Reading server time and state timed out, disconnecting…

I really don’t care about the timestamp check, and I don’t see why it would cause the server to disconnect…

Every new release it seems like there is a new error that deals with the timestamp check. Just disable it.

Kevin.Herron · August 22, 2013, 1:27pm

[quote=“dfenter”]Is there a way you can disable that you can disable the function that makes the OPC server disconnect if the timestamp check fails.

Specifically this function: Reading server time and state timed out, disconnecting…

I really don’t care about the timestamp check, and I don’t see why it would cause the server to disconnect…

Every new release it seems like there is a new error that deals with the timestamp check. Just disable it.[/quote]

It has nothing to do with the value returned, it’s just picking two nodes that the OPC-UA spec says will be there in every server and reading them. At the request of you and others, there is no longer any logic that validates the sanity of timestamps on received values.

The problem isn’t that the time returned is wrong. It’s that the act of reading two values that should just be readily available in memory timed out. Which means nothing else is probably working (well, timely, or at all) either. It’s also verifying that the current session is valid, and more importantly, keeping the session alive in cases where people have very slow, or worse, no subscriptions (but still periodically read and write to the server).

I understand that it’s frustrating when things break. It sucks that you have to deal with that. But you’ve built a complicated system with a lot of moving pieces and if you change or upgrade a major component there will be fallout. 7.5.10 is by far the most stable release in the 7.5.x series so downgrading is not recommended. Plus you’ll just go back to the old problem with timestamps being off by too much, which is why you upgraded in the first place. Finding and fixing the root cause of the current issue is the best path forward.

Pretending it doesn’t exist by simply removing this built-in sanity check would only lead to other problems caused by whatever the underlying problem with these remote servers is.

dfenter · August 22, 2013, 1:44pm

So what is the root cause of the error? How can I fix it? Reinitializing the OPC servers every hour is not sustainable.

I have verified that the values we are reading in Ignition match what we are seeing on the local control system. That is all that I really care about.

Sporatically, the tags will return nulls, and when I click the diagnostics, it says not connected.

dfenter · August 22, 2013, 1:47pm

This only seems to be affecting the sites that are running Siemens PCS7 control systems.

Do I need to change the settings of the KepServer ex program?

Kevin.Herron · August 22, 2013, 1:53pm

When I get into the office today I’ll build you a custom UA module that has ~2x the allowed timeout for read sanity check as well as an increased timeout on the publish requests. My hope is that these servers are periodically just extremely overloaded or something and that the increased timeouts will allow whatever is causing that to pass. If this doesn’t work we’ll go from there…

What UA servers are running at these remote locations that are having problems. Are they part of the Siemens PCS7 system you mentioned, are they Kepware, or are they remote Ignition OPC-UA servers? It also be nice to get an idea of how many tags you’re subscribed to from each of the servers and what the rates of those subscriptions are.

The only time I’ve seen something similar to this was with another customer using Kepware, although the connections weren’t remote. They were periodically doing large batches of writes to tags in the server which for whatever reason would overload the server/driver on occasion. It turns out they were running a pretty old version of Kepware, something like 5.5 or 5.6, and after upgrading to the current 5.12 version the problems disappeared.

dfenter · August 22, 2013, 3:46pm

The remote location we are having the most trouble with is pulling 500 tags at 2000ms, using KEPserverEX 5.3 with the OPC DA license. I am in the process of upgrading the Kepserver program to see if that resolves the issue.

I have other locations that are pulling 7000 tags without any issues, but they are using KEPserver 5.10.

dfenter · August 26, 2013, 3:49pm

Upgrading to KepserverEX 5.12 cleared the errors.

Dan