Gateway Restarted Why?

Tyler_Bennet · February 5, 2025, 4:42pm

I noticed that all of our Andon connections dropped for a few minutes today. Upon reviewing the Scada Gateway, I showed an initial OOM kill in Kubernetes at 10:13. Digging into the Ignition Logs, I showed that the gateway restarted itself at 10:12 prior to Kubernetes recording the OOM Kill.

Do you have any idea what happened here?

Logs from gateway below and in screenshots.
jvm 42 | 2025/02/05 16:11:49 | W [o.a.w.p.PageAccessSynchronizer] [16:11:49]: Thread 'webserver-6085' failed to acquire lock to page with id '29', attempted for 1 minute out of allowed 1 minute. The thread that holds the lock has name 'webserver-5869'.

jvm 42 | 2025/02/05 16:11:49 | W [o.a.w.p.PageAccessSynchronizer] [16:11:49]: "webserver-5869" prio=5 tid=5869 state=BLOCKED

jvm 42 | 2025/02/05 16:11:49 | org.apache.wicket.util.lang.Threads$ThreadDump: null

jvm 42 | 2025/02/05 16:11:49 | E [o.a.w.DefaultExceptionMapper ] [16:11:49]: Unexpected error occurred

jvm 42 | 2025/02/05 16:11:49 |

org.apache.wicket.page.CouldNotLockPageException: Could not lock page 29. Attempt lasted 1 minute

wrapper | 2025/02/05 16:11:50 | TERM trapped. Shutting down.

pturmel · February 5, 2025, 4:45pm

Ooooo! Never seen anything like that. IA Support will want to know.

kcollins1 · February 5, 2025, 5:44pm

Taking a look at those logs, one thing stands out is the jvm 42--this is telling me that it is likely that the Java Service Wrapper has restarted Ignition within that pod numerous times before the external OOM kill was received. It is possible that the final error log you're looking at there is a symptom of some more basic underlying resource issue.

You will need to adjust some of your resource [memory] limits (and underlying Ignition memory settings). It might also be useful to peruse the logs further back to see events surrounding those JVM restarts.

Tyler_Bennet · February 5, 2025, 7:31pm

Upon further analysis, it looks like there was a true OOM kill earlier and it caused the connections for GAN to get hosed. This caused stability issues with the websockets trying to connect it appears. Restarting the gateway fully seems to have knocked sense into it.

We are expanding the resources on this gateway asap!

kcollins1 · February 5, 2025, 9:23pm

At this point, since you've seen an OOM kill (from K8s), increase the "distance" between your Ignition JVM heap allocation and the pod limits. Exactly what to do here will vary with the underlying application running in Ignition, but it is probably best to start conservative and leave a bit more ceiling.

pturmel · February 5, 2025, 9:27pm

Whatever else you do, make sure the memory settings in ignition.conf are the same for initial and maximum. If they are different, and anything else is screwed up, java will blow up later, making troubleshooting difficult.

kcollins1 · February 5, 2025, 10:45pm

... and if you're using our container image, you can drive these settings externally via supplying wrapper/jvm runtime arguments, docs here.

Consider using these to drive a JVM heap memory percentage based on your container's resource limits:

# These will instruct the wrapper to not use explicit values
wrapper.java.initmemory=0
wrapper.java.maxmemory=0
# These JVM args will set the initial and max allocations 
# to a percentage of available memory (i.e. your resource limits).
-XX:InitialRAMPercentage=75
-XX:MaxRAMPercentage=75

This way, if you increase your resource limits, Ignition's memory usage scales accordingly.

Tyler_Bennet · February 5, 2025, 11:04pm

I'll review this in more detail tomorrow with the team. I know some JVM settings were adjusted on other gateways.

Tyler_Bennet · February 7, 2025, 5:10pm

@kcollins1

We set the memory something like this atm... We run other Java workloads in a similar manner that experience little to no JVM kills adding buffer to the workloads of the max memory of JVM vs the containers max memory.

Frequently crashes from OOMKill
Shop A Init Memory 8GB, Max Memory 31GB
Shop A Memory Requests 36 GB, Limits 36 GB.
Shop A CPU Requests 8 vCPU, Limits 8 vCPU

Seems to generally run well
Shop B Init Memory 28GB, Max Memory 73GB
Shop B Memory Requests 32 GB, Limits 86 GB.
Shop B CPU Requests 8 vCPU, Limits 24 vCPU

Experiences frequent crashes from OOM Kill
Shop C Init Memory 256MB, Max Memory 32GB
Shop C Memory Requests 36 GB, Limits 36 GB.
Shop C CPU Requests 4 vCPU, Limits 8 vCPU

pturmel · February 7, 2025, 5:21pm

Duh. You are letting java start with a small amount of its allowed memory, which it has to compete with OS buffers when it (later) tries to allocate it.

Set your initial equal to max in production environments. Always.

Tyler_Bennet · February 7, 2025, 5:24pm

This is a containerized version of Ignition so normally with Java there has to be some buffer. Java will normally consume everything and there has to be some buffer.

I'm open to updating limits but the containers resources have to have buffer typically.

pturmel · February 7, 2025, 5:25pm

Yes, the container allowance must be greater than the java heap allowance. But the java initial heap and maximum heap need to be the same. Really.

Not true. It's heap will stop at its configured maximum. It's total memory will be a little more than that due to JVM and JNI overhead. But it stops there. And it will always get there, because java's garbage collector doesn't try very hard until the heap has grown to its max.