Execution Manager stalling

We have a runnable that we have set up using the executionManager.register that from time to time just stops processing.

Our run function does not throw any exceptions as we catch them all and log them with hopes to let it keep going.

What would cause an object on the shared ExecutionManager to stall and stop processing? Both of ours quit executing and we aren't seeing anything in the wrapper log.

Any help would be appreciated.

I can only think of 2 scenarios...

  1. your task threw an Error, not an Exception/Throwable, such as StackOverflowError or OutOfMemoryError
  2. all threads of the ExecutionManager's fixed size thread pool are blocked

A thread dump is a good place to start because it would at least show #2.

Error is a subclass of Throwable. Catch and log throwables in runnables, not exceptions.

Oops, yeah. And we already catch stray Throwables for tasks submitted to the ExecutionManager.

So, to be clear, the runnable should catch any Throwable to prevent it from stopping in the execution manager. Not just catching Exceptions?

It should, but it doesn't actually have to. The ExecutionManager wraps submitted tasks and catches stray Throwables, and logs an error to the gateway logs if that happens.

We aren't seeing the error in the log or maybe we aren't sure what error to look for. The runnable just stops processing.

Any idea what it might show up as in the wrapper log?

Something like "Task %s %s threw uncaught exception."

I'm not going to help you any more until you produce a thread dump...

I can produce a thread dump, but I have already got the system back up and running. Production was down. So is the thread dump still helpful?

Ignition-ICOB-MOCLIFT01_thread_dump20230817-111540.json (154.2 KB)

Yikes. You really shouldn't be doing module development against a production gateway.

Maybe not as useful, but I'll look anyway. Better to get one once the thread growth and other problems are being observed.

We weren't doing development against production. It was observed in production and we were trying to find a solution to test in our dev environment. The thing that is stumping us is that it took weeks for this to appear again and it doesn't appear at every site using this version of our module.

I grabbed wrapper log and our system's log data and didn't find anything. It is good to know that we should grab a thread dump as well if this happens again.

Thank you for your help and let me know if you find anything.

This thread dump looks fairly innocuous, but keep an eye on the 12 threads from the ExecutionManager, named gateway-shared-exec-engine-N.

This dump has 3 blocked with stack traces like this:

java.base@11.0.15/java.net.SocketInputStream.socketRead0(Native Method)
java.base@11.0.15/java.net.SocketInputStream.socketRead(Unknown Source)
java.base@11.0.15/java.net.SocketInputStream.read(Unknown Source)
java.base@11.0.15/java.net.SocketInputStream.read(Unknown Source)
java.base@11.0.15/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
java.base@11.0.15/java.util.concurrent.FutureTask.runAndReset(Unknown Source)
java.base@11.0.15/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
java.base@11.0.15/java.lang.Thread.run(Unknown Source)

If those are blocking indefinitely, or for a long time, and you end up with 12 blocked at once, all your other submitted tasks will be waiting in line and not executing until they unblock.

1 Like