Thousands of TIMED_WAITING threads causing gateway crash

Samuel_Sueur · October 9, 2024, 10:17am

Hi all,

We're having issues with one frontend gateway (only used for visualisation, and this gateway get tag values from remote gateways).

For days we encounter no problems. Then at some point, we run into JVM out of memory because the thread count exceeds the the maximum task limit allocated by Linux to a process.

In this instance, the gateway is still running, but we can't access the gateway config page (only the left side bar is loaded but clicking on "Config" or any other button produces a blank view in the middle). Failing over to the redundant gateway also takes ages as no thread can be created to initiate the failover we assume...

We've managed to get a thread dump. This is what we see with Kindling:

80% of the perspective threads are in "TIMED_WAITING" state and it seems only perspective-worker threads are responsible of this... They are using 0.00% CPU and they all have the same stacktrace:

java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method)
java.base@17.0.9/java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
java.base@17.0.9/java.util.concurrent.SynchronousQueue$TransferStack.transfer(Unknown Source)
java.base@17.0.9/java.util.concurrent.SynchronousQueue.poll(Unknown Source)
java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
com.inductiveautomation.perspective.gateway.threading.BlockingWork$BlockingWorkRunnable.run(BlockingWork.java:58)
java.base@17.0.9/java.lang.Thread.run(Unknown Source)

We're not sure but we don't think this is a memory leak as the issue sometimes takes weeks to happen, and sometimes it will occur the day after a gateway reboot...

We're going to add additional properties to limit the pool size of perspective workers but we don't think this will solve the issue.

We've opened a ticket with the french Ignition support but we wanted to get some thoughts here first in case someone has already encountered the issue.
We might have to escalate this ticket to get to the american support..

I've seen someone else on the forum mentioning that their gateway goes crazy when they lose a particular PLC connection, we'll start historizing all device states and see if we can correlate something out of this...

Thanks for any insights you might be able to provide.

pturmel · October 9, 2024, 12:18pm

When the Perspective worker threads are in TIMED_WAITING with that stack trace, they are harmless and will eventually get pruned. Their profuse existence is a side effect of your real problem. Which is likely some event script that is doing some .sleep() or other blocking operation, but has some source that keeps firing and firing and firing. Yielding hundreds or thousands of threads that become a thundering herd, crashing your system.

You should add timing instrumentation to all of your event scripts and transforms, using java's System.nanoTime(). This is easy if you've followed best practices and made all of your events and transforms delegate to project library script functions. As then you can add a decorator to do the timing.

Log any event script that takes longer than ten milliseconds to find your worst offenders. Fix anything that takes longer that is in your UI, or non-UI tasks that don't have a dedicated thread.

PGriffith · October 9, 2024, 2:23pm

This is not a solution to the issues at hand, but I'll also mention that to improve system stability temporarily there are some system props you can set to tune Perspective's thread pooling strategy:

Edit: I see you mentioned this in your post, but I'll leave the link for posterity in case someone else comes across this thread in the future.

Samuel_Sueur · October 9, 2024, 4:20pm

Thank you both as always @PGriffith & @pturmel

We'll monitor everything closely and go through our scripts to log any scripts taking longer than "normal".

In the meantime, we've added the "pool-size" parameter to our ignition configuration.

Just to make sure, we've currently set the pool size of the workers to 8000 and the pool size of the queues to 8000 as well.
I assume it is expected to have constantly 16 000 threads (~99% of them in WAITING state), correct ? As they don't have a time limit anymore they don't get removed and are instead reused whenever necessary ? Or is this an unexpected behavior and I'm getting this wrong ?