Clockdriftdetector

Curlyandshemp · June 25, 2018, 12:02am

What exactly is going on to create this?

I am seeing this on quite a few Ignition installs that I have that only have 1 to 3 OPC-UA PLC connections reading less than 500 tags total and maybe only 250 - 500 SQL tags logged to a database. Most of these errors occur on a customer supplied VM on their corporate network.

Is there some VM resource parameter i need to ask them to increase?

pturmel · June 25, 2018, 12:16am

This could be simple VM overcommit, but its more likely to be garbage collection under the classic Concurrent-Mark-Sweep algorithm. You really–really–need to use G1GC. And then monitor your gateway’s memory utilization under Status=>Performance. You should see a regular sawtooth over a few to several minutes from ~ 50% to ~90%. Or less. If it won’t drop to below 50% on regular intervals, bump up the memory settings in ignition.conf.

Curlyandshemp · June 25, 2018, 1:53am

Maybe i am a simpleton, but i don’t understand how your answer explains to me on what is going on at a customer’s install.

pturmel · June 25, 2018, 2:15am

“Pause-the-world” is what happens when java runs out of heap memory and must clear out discarded objects to complete a requested operation. It also happens when java is pre-emptively clearing memory and must handle unavoidable memory relocations. All java threads stop during a GC pause. The classic algorithms have pathological pauses in some circumstances. The G1GC algorithm has dramatically superior performance and can avoid practically all long pauses.

As for your customer, configuring GC for performance, and adding the logging options, will either correct their problem or rule out GC/memory allowance as a cause. Until you rule out GC pauses, you won’t have consistent evidence for VM or OS issues.

Kevin.Herron · June 25, 2018, 3:06am

It’s hard to see in the screenshots, but if you look close those numbers are flip-flopping between negative and positive.

This probably means it’s not memory or performance related and the clock is actually being set forward and back ~47s by the VM host.

pturmel · June 25, 2018, 10:52am

Oooh! Missed that. That is quite possibly a fight between a hypervisor's clock and an external NTP server. Guests should not run NTP themselves in such architectures.

Sanderd17 · June 26, 2018, 12:26pm

Yep, had the same issue a few months ago. Ubuntu installed itself connected to an NTP server, and VMWare automatically set the clock from its base system. So every few minutes, they were fighting and SQL queries were timing out.