G1GC Issue? Thread Viewer causes Clock Drift

Hello - we are running Ignition 7.9.9, with 32GB memory dedicated to Java on a 128GB RAM virtual machine (4 core Xeon Gold 5118 @ 2.3 GHz) running Windows Server 2012 R2. We have encountered significant issues with clock drift and application hang over the last few weeks, with no obvious culprit. We can consistently cause a drift/hang by opening the thread viewer on the gateway page. Has anybody seen behavior like this, and more importantly, know how to resolve it? Our garbage collection settings are below (and yes, we have logging enabled on purpose):

wrapper.java.additional.1=-XX:+UseG1GC
#wrapper.java.additional.2=-XX:MaxGCPauseMillis=50
wrapper.java.additional.3=-XX:+PrintGCDetails
wrapper.java.additional.4=-XX:+PrintGCApplicationStoppedTime

wrapper.java.additional.5=-Ddata.dir=data
wrapper.java.additional.6=-Dorg.apache.catalina.loader.WebappClassLoader.ENABLE_CLEAR_REFERENCES=false

wrapper.java.additional.10=-Dignition.allowunsignedmodules=true
wrapper.java.additional.11=-Dia.developer.moduleupload=true

Initial Java Heap Size (in MB)

wrapper.java.initmemory=2048

Maximum Java Heap Size (in MB)

wrapper.java.maxmemory=32000

Thanks!

One thing you probably will want to do is lower you maximum java heap size. Look at your java memory usage history, it should look like a sawtooth pattern. Whatever the memory amount of your sawtooth peak is, add about 30% to that amount and use that value as your maximum heap size. I’m not sure if that will solve your problem, but it will help make the G1GC more efficient.

image

It seems that 32GB is not far off from that recommendation then.

Yeah you are right, you could probably get away with 27gb max, but that probably isn’t the key to your problem.

Yeah… I appreciate the thought though! There’s just a lot of activity on this server.

Further data points around G1GC - it seems that there was a normal minor GC event, but at the end of that it expanded the Eden space from 8MB to 17GB. On data collection immediately after, we ran into a 17 second object copy.

Log Details:

Blockquote
INFO | jvm 4 | 2019/09/26 13:43:38 | [GC pause (G1 Evacuation Pause) (young), 0.0458696 secs]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Parallel Time: 23.3 ms, GC Workers: 13]
INFO | jvm 4 | 2019/09/26 13:43:38 | [GC Worker Start (ms): Min: 642559943.4, Avg: 642559943.5, Max: 642559943.7, Diff: 0.3]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Ext Root Scanning (ms): Min: 1.7, Avg: 2.0, Max: 3.6, Diff: 1.9, Sum: 26.3]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Update RS (ms): Min: 10.1, Avg: 11.8, Max: 12.0, Diff: 1.9, Sum: 153.0]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Processed Buffers: Min: 18, Avg: 30.3, Max: 54, Diff: 36, Sum: 394]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.6]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Object Copy (ms): Min: 8.9, Avg: 9.0, Max: 9.2, Diff: 0.2, Sum: 117.5]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Termination (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.8]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Termination Attempts: Min: 1, Avg: 65.4, Max: 85, Diff: 84, Sum: 850]
INFO | jvm 4 | 2019/09/26 13:43:38 | [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.1, Sum: 0.9]
INFO | jvm 4 | 2019/09/26 13:43:38 | [GC Worker Total (ms): Min: 22.8, Avg: 23.0, Max: 23.1, Diff: 0.3, Sum: 299.2]
INFO | jvm 4 | 2019/09/26 13:43:38 | [GC Worker End (ms): Min: 642559966.5, Avg: 642559966.5, Max: 642559966.6, Diff: 0.1]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Code Root Fixup: 0.0 ms]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Code Root Purge: 0.0 ms]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Clear CT: 0.4 ms]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Other: 22.2 ms]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Choose CSet: 0.0 ms]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Ref Proc: 19.9 ms]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Ref Enq: 0.6 ms]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Redirty Cards: 0.3 ms]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Humongous Register: 0.1 ms]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Humongous Reclaim: 0.1 ms]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Free CSet: 0.6 ms]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Eden: 1704.0M(1704.0M)->0.0B(17.1G) Survivors: 184.0M->72.0M Heap: 12.6G(31.3G)->10.8G(31.3G)]
INFO | jvm 4 | 2019/09/26 13:43:38 | [Times: user=0.42 sys=0.00, real=0.05 secs]
INFO | jvm 4 | 2019/09/26 13:43:38 | Total time for which application threads were stopped: 0.0483955 seconds, Stopping threads took: 0.0001874 seconds
INFO | jvm 4 | 2019/09/26 13:43:46 | Total time for which application threads were stopped: 0.0030136 seconds, Stopping threads took: 0.0002035 seconds
INFO | jvm 4 | 2019/09/26 13:43:46 | Total time for which application threads were stopped: 0.0022161 seconds, Stopping threads took: 0.0001000 seconds
INFO | jvm 4 | 2019/09/26 13:43:46 | Total time for which application threads were stopped: 0.0020475 seconds, Stopping threads took: 0.0000656 seconds
INFO | jvm 4 | 2019/09/26 13:43:46 | Total time for which application threads were stopped: 0.0019047 seconds, Stopping threads took: 0.0000951 seconds
INFO | jvm 4 | 2019/09/26 13:43:46 | Total time for which application threads were stopped: 0.0017944 seconds, Stopping threads took: 0.0000910 seconds
INFO | jvm 4 | 2019/09/26 13:43:46 | Total time for which application threads were stopped: 0.0014950 seconds, Stopping threads took: 0.0000852 seconds
INFO | jvm 4 | 2019/09/26 13:43:46 | Total time for which application threads were stopped: 0.0013441 seconds, Stopping threads took: 0.0000794 seconds
INFO | jvm 4 | 2019/09/26 13:43:46 | Total time for which application threads were stopped: 0.0013973 seconds, Stopping threads took: 0.0001629 seconds
INFO | jvm 4 | 2019/09/26 13:43:50 | Total time for which application threads were stopped: 0.0022608 seconds, Stopping threads took: 0.0002379 seconds
INFO | jvm 4 | 2019/09/26 13:43:56 | I [c.i.i.g.s.g.f.Projects$ProjectChangeMonitor] [11:43:56]: Starting up client project monitor. project=XXXXXXXXXX, uuid=d550e43b-2a35-2a5c-059f-1d1eb70a9975, editCount=3377, scope=4, version=Published request-origin=10.52.212.48, session-user=OP, session-project=XXXXXXXXXX, session-id=8C5248F4
INFO | jvm 4 | 2019/09/26 13:43:56 | Total time for which application threads were stopped: 0.0028659 seconds, Stopping threads took: 0.0001919 seconds
INFO | jvm 4 | 2019/09/26 13:43:56 | Total time for which application threads were stopped: 0.0019966 seconds, Stopping threads took: 0.0001740 seconds
INFO | jvm 4 | 2019/09/26 13:43:56 | Total time for which application threads were stopped: 0.0017132 seconds, Stopping threads took: 0.0001392 seconds
INFO | jvm 4 | 2019/09/26 13:43:56 | Total time for which application threads were stopped: 0.0023148 seconds, Stopping threads took: 0.0002129 seconds
INFO | jvm 4 | 2019/09/26 13:43:56 | Total time for which application threads were stopped: 0.0023299 seconds, Stopping threads took: 0.0002030 seconds
INFO | jvm 4 | 2019/09/26 13:43:56 | Total time for which application threads were stopped: 0.0015708 seconds, Stopping threads took: 0.0001941 seconds
INFO | jvm 4 | 2019/09/26 13:43:56 | Total time for which application threads were stopped: 0.0015231 seconds, Stopping threads took: 0.0001709 seconds
INFO | jvm 4 | 2019/09/26 13:43:57 | I [T.ClientInterface ] [11:43:56]: Removing Tags due to switching user. request-origin=10.52.212.48, session-user=OP, session-project=XXXXXXXXXX, session-id=8C5248F4
INFO | jvm 4 | 2019/09/26 13:43:59 | Total time for which application threads were stopped: 0.0028913 seconds, Stopping threads took: 0.0002459 seconds
INFO | jvm 4 | 2019/09/26 13:44:00 | Total time for which application threads were stopped: 0.0026298 seconds, Stopping threads took: 0.0002736 seconds
INFO | jvm 4 | 2019/09/26 13:44:00 | Total time for which application threads were stopped: 0.0032073 seconds, Stopping threads took: 0.0003124 seconds
INFO | jvm 4 | 2019/09/26 13:44:02 | Total time for which application threads were stopped: 0.0036977 seconds, Stopping threads took: 0.0002486 seconds
INFO | jvm 4 | 2019/09/26 13:44:06 | [GC pause (G1 Evacuation Pause) (young)
INFO | jvm 4 | 2019/09/26 13:44:23 | , 17.5622303 secs]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Parallel Time: 17498.2 ms, GC Workers: 13]
INFO | jvm 4 | 2019/09/26 13:44:23 | [GC Worker Start (ms): Min: 642586730.4, Avg: 642586730.5, Max: 642586730.6, Diff: 0.3]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Ext Root Scanning (ms): Min: 2.0, Avg: 2.5, Max: 5.6, Diff: 3.6, Sum: 31.9]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Update RS (ms): Min: 14.4, Avg: 17.5, Max: 18.1, Diff: 3.7, Sum: 227.7]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Processed Buffers: Min: 25, Avg: 39.3, Max: 58, Diff: 33, Sum: 511]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Scan RS (ms): Min: 0.2, Avg: 0.5, Max: 0.7, Diff: 0.5, Sum: 6.7]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Object Copy (ms): Min: 17092.4, Avg: 17122.5, Max: 17157.8, Diff: 65.4, Sum: 222592.0]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Termination (ms): Min: 319.4, Avg: 354.7, Max: 384.7, Diff: 65.3, Sum: 4611.3]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Termination Attempts: Min: 366042, Avg: 401317.8, Max: 429641, Diff: 63599, Sum: 5217131]
INFO | jvm 4 | 2019/09/26 13:44:23 | [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 1.9]
INFO | jvm 4 | 2019/09/26 13:44:23 | [GC Worker Total (ms): Min: 17497.7, Avg: 17497.8, Max: 17498.0, Diff: 0.4, Sum: 227471.8]
INFO | jvm 4 | 2019/09/26 13:44:23 | [GC Worker End (ms): Min: 642604228.2, Avg: 642604228.3, Max: 642604228.4, Diff: 0.2]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Code Root Fixup: 0.0 ms]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Code Root Purge: 0.0 ms]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Clear CT: 2.1 ms]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Other: 61.9 ms]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Choose CSet: 0.0 ms]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Ref Proc: 53.9 ms]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Ref Enq: 1.9 ms]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Redirty Cards: 0.7 ms]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Humongous Register: 0.2 ms]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Humongous Reclaim: 0.1 ms]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Free CSet: 4.0 ms]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Eden: 17.1G(17.1G)->0.0B(8192.0K) Survivors: 72.0M->1688.0M Heap: 28.0G(31.3G)->12.4G(31.3G)]
INFO | jvm 4 | 2019/09/26 13:44:23 | [Times: user=227.24 sys=0.05, real=17.56 secs]

Is there a particular reason you commented out the desired MaxGCPauseMillis in Ignition.conf? I know that it is not a guarantee, but G1GC will perform more smaller GCs to try to reach that target. 50ms might be unrealistic in a 32G system, but a setting of 100ms or 250ms might have pushed it to avoid the 17 second pause.

We were originally running under the MaxGCPauseMillis=50 in the config, but we suspected that the 50ms was indeed unrealistic. Do you think suggesting a cap of 250ms would be more optimal than running uncapped?

I’ve seen recommendations in various blogs/sites around that we could set -Xms equal to -Xmx to keep a consistent heap size, and then use either NewSize + MaxNewSize to control the young generation, or allow a bit more dynamics through the NewRatio + SurvivorRatio flags. Have you had success with either of these approaches using Ignition?

A further thought - would our version of JRE affect this? We are running 1.8.0_171-b11.

Yes.

I have not played with -Xms or the New/Survivor Ratio flags. Setting -Xms and -Xmx identical does make sense if you know your heap needs. It seems you do.

I would think it unlikely. G1GC was pretty mature back in Java7.

Thanks Phil. I will post an update on what we find during testing over the next couple of days.

Hi Tim last year we had long running issue with clock drifts and very high memory usage impacting the gateway. It took us a really long time to resolve but I found the java flight recorder to be a useful tool to help troubleshoot. Initially I reduced my symptoms by allocating more memory to Ignition and allocated more DB connections. I also made sure that the CPU & memory were reserved in the VM.

The root cause for our issue was a slow query asking for a huge amount of data which created a CPU spike that caused memory growth straight after that the GC didn’t clean up.

Thanks for the ideas. I will check out the java flight recorder; we’ve been in touch with support and have been using jmap. We’re still working through this issue, but your root cause sounds exactly like the path we are headed down - we found that the historian weekly partition contains 16 million records and was taking over 4 minutes to return.

The flight recorder initiated a ‘manual’ GC event and our memory usage instantly dropped from 20Gb to 3Gb which is normal for us. This highlighted that our expensive queries were only triggered once after gateway restarts and eventually we tracked down some old very MES reports that were only accessed occasionally. When the window was first opened the query would use the cached date trying to return 3 years worth of data rather than the last 24hrs - I just deleted the window and spent a bit of time looking for similar queries in my projects - never worked out why GC wouldn’t release the memory that I thought it was meant to.

The GC only releases memory when it feels a certain pressure to do so (at least that's the default configuration). If it already reached 20GB max memory, it will feel no pressure to do big cleanups.

There are options to force cleanups, but it's better to just find and fix your issue and let the default handling do its job (as you did).

Thanks everyone for the replies. We’ve had no success at troubleshooting this from a GC perspective. We are now looking into the Historian side of things… hopefully we’ll find something there!

I noticed on 7.9.13 and 8.1.0 that clock drift warning occur every time the gateway Threads page is viewed. On the 8.1.0 system it occurred after a fresh install of Ignition with only 3 db connections created. There were no projects at all on the system.

I’m using 8.0.13 (for a project that was supposed to go live 6 months ago :roll_eyes:) and I’m able to replicate this very same issue. It seems to cause a cascade of clock drift warnings so long as the page is open. Did you find any resolution on it that you could share?

I did not find a resolution.

Getting thread details is a very disruptive action within the JVM–it isn’t an Ignition bug. Same thing will happen if you use JDK debugging tools to capture a thread dump or other debugging snapshots.

2 Likes