NTP Issues and failing devices

Andrew_P · June 23, 2019, 10:36am

Hi,

I have a rather difficult and urgent problem. I am syncing ignition gateway to an NTP server. If it is out too far from the currently set time, some of the devices fail indefinitely with no indication why. What’s worse is that it is a high availability setup, and because the entire gateway doesn’t fail, it does not fail over to the other server. However, forcing the machine to shutdown and failover, STILL doesn’t fix the issue. I manually have to turn of the OPC server connection and turn it back on - not acceptable for the customer to have to do this. And they would have to identify the problem first too which is another issue.

Any ideas what to do would be appreciated, im stumped. We are nearing the end of commissioning and I don’t want to deliver a high availability system which is extremely fragile.

P.S. Worth noting, if the time syncing moves the clock to the future, there is no problem. if it is going to the past, it fails.

Also worth noting is that the clock source is a GPS which syncs a PLC, which is the time server. The PLC was not compatible with windows time service, so we are running a third party utility called NetTime. I am yet to prove whether using the in built windows (windows 10) time sync service would resolve this issue.

Thanks

Andrew

Kevin.Herron · June 23, 2019, 12:31pm

What version of Ignition is this?

What drivers are failing? What does “failure” look like or mean?

Is the OPC connection faulting? Does edit/saving a failed device instead of the entire OPC connection “fix” it?

Andrew_P · June 23, 2019, 6:45pm

Hi Kevin,

Thanks for the response, sorry for the lack of detail, was in a bit of a rush.

Ignition version is 7.9.8. The drivers being used are Modbus/TCP drivers. I have about 20 or so devices, but they are connecting to the same PLC, in order to get throughput required (ignition support endorsed this as a suitable work around for a throughput problem i had with the driver and a single device). Only some devices fail. All I can tell about failure at the moment is that all tags on the device go to bad quality. The device still says it is connected.

Resetting the specific device connection does work, but due to the nature of having 20 devices, during testing i needed a quick way to cover all devices without trying to identify which devices actually failed, so i reset the OPC connection (edit/disable/save / edit/enable/save). The OPC connection does not report that it has failed either.

Hope this helps, let me know if you need more information.

Thanks for your help,

Andrew

Kevin.Herron · June 23, 2019, 7:25pm

When this happens are the logs silent or do you start getting error messages or timeouts for those devices?

How big are the backwards time jumps?

Andrew_P · June 23, 2019, 8:05pm

I did try quickly have a look at the logs but couldn’t see anything I thought was related but I might look again as I only skimmed.

Also initially the time jumps where huge (hours) because we were manually setting place time to an arbitrary time to test. But we tried to take the gap down to as short as we could practically (around 1 minute) but it was hard to try to control the gap exactly. I would have thought 1 minute is a possible drift in clocks if for some reason the clock source was down for a while. Any ideas on how much time deviation it can actually tolerate if a large difference in time is an issue? Also, although these time changes broke redundancy, we also tried time changes with the backup server already down so there was no issues caused by transfer to backup server with different time.

Kevin.Herron · June 23, 2019, 10:16pm

Can you reproduce this fairly easy using that utility?

If you can, or next time it happens, figure out the name of one of the devices that has bad tags and then in the gateway logger area configure the log level for logger you get searching “DriverVariableNode” to DEBUG and see if there’s any messages about setting stale values.

pturmel · June 24, 2019, 1:40am

Also consider moving to an operating system that supports standard NTP operation – adjusting the OS RTC frequency to converge the local clock to the network clock instead of jumping, with frequency tweaks (not jumps) every few minutes. Hint: Not Windows.

Andrew.Zebic · June 24, 2019, 3:58am

Hi Kevin,
I’m Andrew’s coworker. There’s no log entries found for any time at all searching for DriverVariableNode with the Min level as debug.
Regards,
Andrew

Kevin.Herron · June 24, 2019, 3:14pm

You won’t find any entries retroactively, this log level needs to be changed (or left at DEBUG) once a backwards time change has occurred and the bad quality tags start happening.

ingraham · June 28, 2019, 3:32pm

Out of curiosity, what is the PLC?

Does the failure happen when the PLC's clock updates or when the PC's clock updates?