Hi all,
We're having issues with one frontend gateway (only used for visualisation, and this gateway get tag values from remote gateways).
For days we encounter no problems. Then at some point, we run into JVM out of memory because the thread count exceeds the the maximum task limit allocated by Linux to a process.
In this instance, the gateway is still running, but we can't access the gateway config page (only the left side bar is loaded but clicking on "Config" or any other button produces a blank view in the middle). Failing over to the redundant gateway also takes ages as no thread can be created to initiate the failover we assume...
We've managed to get a thread dump. This is what we see with Kindling:
80% of the perspective threads are in "TIMED_WAITING" state and it seems only perspective-worker threads are responsible of this... They are using 0.00% CPU and they all have the same stacktrace:
java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method)
java.base@17.0.9/java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
java.base@17.0.9/java.util.concurrent.SynchronousQueue$TransferStack.transfer(Unknown Source)
java.base@17.0.9/java.util.concurrent.SynchronousQueue.poll(Unknown Source)
java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
com.inductiveautomation.perspective.gateway.threading.BlockingWork$BlockingWorkRunnable.run(BlockingWork.java:58)
java.base@17.0.9/java.lang.Thread.run(Unknown Source)
We're not sure but we don't think this is a memory leak as the issue sometimes takes weeks to happen, and sometimes it will occur the day after a gateway reboot...
We're going to add additional properties to limit the pool size of perspective workers but we don't think this will solve the issue.
We've opened a ticket with the french Ignition support but we wanted to get some thoughts here first in case someone has already encountered the issue.
We might have to escalate this ticket to get to the american support..
I've seen someone else on the forum mentioning that their gateway goes crazy when they lose a particular PLC connection, we'll start historizing all device states and see if we can correlate something out of this...
Thanks for any insights you might be able to provide.