Troubleshooting clock drift

rbachman · November 4, 2021, 8:39pm

We’re experiencing some Clock Drift errors, and I’m pretty confident it is related to JVM Garbage Collection. The errors nearly always correlate when it looks like GC runs.

I’m new to JVM GC, so I’m trying to figure out the best way to address this. We currently have 1GB for init and 16GB for max. We are currently running 8.1 at all of our sites.

Here are a couple of options we’ve been considering:

Lower max heap size so GC has less data to process.
Increase init heap size so the gap between init and max is smaller. I’m wondering if GC reduces memory back to the init level.

Does anyone have any thoughts on this?

Kevin.Herron · November 4, 2021, 8:42pm

You could try lowering the max and/or adding -XX:MaxGCPauseMillis=100 to your ignition.conf.

Or just do nothing, because it doesn’t seem that bad

edit: ah, also, use real hardware instead of virtualized hardware if that’s not already happening.

pturmel · November 5, 2021, 1:28am

This first.

rbachman · November 5, 2021, 1:59pm

Thanks! I think we’ll try putting this in place this weekend.

rbachman · November 5, 2021, 2:10pm

It’s not that bad right now, but we’ve had issues recently because of Clock Drift errors. We had a few situations where Ignition missed tag changes on the PLC during the Clock Drift error.

We are running Ignition as a VM, so we went from 4 to 12 cores, and that helped immensely.

Kevin.Herron · November 5, 2021, 2:15pm

When you said it missed a value, do you mean:

E.g. the value went from 1 to 2 and you never saw 2
E.g. the value went from 1 to 2 to 3 and you saw 1 and 3 but missed 2

rbachman · November 9, 2021, 10:04pm

There is a BOOL tag on the PLC that went from high to low and back to high, but Ignition missed the change to low. As a result, Ignition did not fire a gateway event, and our production line stopped.

So it would be more similar to example 2 that you provided.

Kevin.Herron · November 9, 2021, 10:11pm

Okay, that is the kind of thing a GC pause could cause you to miss. It’s also something you could miss anyway if the change happened too fast and you got unlucky with the polling times. Without a handshake of some sort to reset you’ll always technically run that risk, even if you are generous with your timing.

rbachman · November 10, 2021, 3:57pm

It shouldn’t be an issue with polling times. The tag is normally low for at least 2-3 seconds, and we’re polling at either 250 or 500ms. In this particular case, we found a Clock Drift that lasted about 5 seconds at the exact time when we think the tag would have turned FALSE.

I double checked the logic at this particular station, and the handshake isn’t set up quite the way I would like. It looks like we’re triggering off of a part present tag, so the problem here was that Ignition didn’t see when one part left the station and the next part entered the station. I may work with our controls team to see if we can arrange for better handshake logic.

SKA · November 11, 2021, 7:52am

Not to be that person… but here it is anyways.
Why is the PLC reliant on a scada/hmi for it to function? No production-breaking data should be relied in this manner imo, all relevan data should be in the plc before start, or the line has to be able to pause and wait for the data to arrive where the stop/paus will not cause everything to break and go to a screetching halt.

rbachman · November 11, 2021, 1:28pm

The short answer is that we use Ignition for MES, so we’re using it for a whole lot more than SCADA/HMIs. For this particular plant we rely on it for historical data for making decisions and confirming data collection. We scan barcodes on parts prior to them entering machines. Ignition looks up historical data related to the part to deliver back to the PLC. So if, for example, a part fails a particular quality test at machine “A” in the process, it is possible that it is unnecessary to test the part in machine “B”. Ignition will tell machine “B” that the part has already been rejected, and machine “B” will skip testing the part, which reduces cycle time and wear on the machine.

We also have very strict and specific data collection requirements with our customer, so we can’t ship parts to our customer without capturing specific data. The PLC will hold a part in the machine and wait for confirmation from Ignition that the data has been stored before releasing the part to the next station.

bschroeder · November 11, 2021, 1:31pm

If that is the case then I would suggest not using a bit that changes state to flag Ignition that it needs to do something. Look at doing an integer that changed value, and if Ignition hasn’t responded in a proper amount of time, change that integer again. If you are dealing with OEM equipment I know some of that will be difficult to do.

rbachman · November 11, 2021, 1:44pm

We do exactly that in certain cases at some of our other facilities, but it would likely be challenging in this specific case. This is a very rare issue that just popped up recently, and I think it was more of a CPU bottleneck than anything. Since increasing our server from 4 cores to 12, this problem has disappeared. At this point, we’re trying to make incremental improvements to reduce the risk even further, especially now that we know that Clock Drift errors appear to be tied directly to Java GC.

Kevin.Herron · November 11, 2021, 1:48pm

This is one common cause. Perhaps the most common. If your gateway's clock shifts dramatically you may also notice a message. Virtual machines always seem to be more sensitive to all causes.

rbachman · November 11, 2021, 1:51pm

We actually had an issue when we tried to implement -XX:MaxGCPauseMillis=100 last weekend. Another person on my team put it on our backup server, and he couldn’t get Ignition to start after that. I believe the exact line he added was this:

wrapper.java.additional.2=-XX:MaxGCPauseMillis=250

Forgive our ignorance; is that the correct format? We’re running Ignition 8.1.7 at that facility.

Kevin.Herron · November 11, 2021, 1:53pm

Hmm, that looks right, but without seeing the other parameters I can only assume the 2 is correct…

Also, 200 is the default, so you’re actually making it worse by using 250.

edit: maybe a copy/paste from here or email ended up with something silly like a em-dash instead of a plain dash in the file?

bmeyers · March 15, 2022, 5:27pm

-XX:MaxGCPauseMillis=100

What is this line of code doing specifically?
Can we add it anywhere in the ignition.conf file?
Do I need to restart the Ignition Gateway after adding the line?

Kevin.Herron · March 15, 2022, 5:29pm

Sets a target value for the maximum pause time in the JVM's GC. Best effort.

No, you have to add it to the additional params section as above.

Yes.

bmeyers · March 15, 2022, 5:39pm

Would this be correct?

Kevin.Herron · March 15, 2022, 5:41pm

No, you’d want something like:

wrapper.java.additional.3=-XX:MaxGCPauseMillis=100

on that line.

and just for your sake and anybody else coming along later - this change isn’t free and should only be applied to “healthy” systems. The cost of lower pause times is more frequent (but probably smaller) GC events and some amount of higher CPU usage.