Clockdriftdetector

What exactly is going on to create this?

I am seeing this on quite a few Ignition installs that I have that only have 1 to 3 OPC-UA PLC connections reading less than 500 tags total and maybe only 250 - 500 SQL tags logged to a database. Most of these errors occur on a customer supplied VM on their corporate network.

Is there some VM resource parameter i need to ask them to increase?

This could be simple VM overcommit, but its more likely to be garbage collection under the classic Concurrent-Mark-Sweep algorithm. You really–really–need to use G1GC. And then monitor your gateway’s memory utilization under Status=>Performance. You should see a regular sawtooth over a few to several minutes from ~ 50% to ~90%. Or less. If it won’t drop to below 50% on regular intervals, bump up the memory settings in ignition.conf.

Maybe i am a simpleton, but i don’t understand how your answer explains to me on what is going on at a customer’s install.

“Pause-the-world” is what happens when java runs out of heap memory and must clear out discarded objects to complete a requested operation. It also happens when java is pre-emptively clearing memory and must handle unavoidable memory relocations. All java threads stop during a GC pause. The classic algorithms have pathological pauses in some circumstances. The G1GC algorithm has dramatically superior performance and can avoid practically all long pauses.

As for your customer, configuring GC for performance, and adding the logging options, will either correct their problem or rule out GC/memory allowance as a cause. Until you rule out GC pauses, you won’t have consistent evidence for VM or OS issues.

It’s hard to see in the screenshots, but if you look close those numbers are flip-flopping between negative and positive.

This probably means it’s not memory or performance related and the clock is actually being set forward and back ~47s by the VM host.

4 Likes

Oooh! Missed that. That is quite possibly a fight between a hypervisor's clock and an external NTP server. Guests should not run NTP themselves in such architectures.

1 Like

Yep, had the same issue a few months ago. Ubuntu installed itself connected to an NTP server, and VMWare automatically set the clock from its base system. So every few minutes, they were fighting and SQL queries were timing out.