Gateway timers stopped without any reason

patrick.garceau · November 21, 2019, 2:31pm

I have a problem where the Gateway script stopped without any reason or logs. Just stalled!
Restarting the Gateway or the service has solve the problem.

I need to know why the software is doing that and what can I do to either monitor or make sure it does not occur again. We have two sites that did that for now.

Thanks for any help.

Kevin.Herron · November 21, 2019, 3:12pm

Take a thread dump next time it happens. What’s likely is that some script you’re executing off the timer has blocked, and we may be able to see what it’s blocked on in the thread dumps.

patrick.garceau · November 21, 2019, 3:29pm

Is there a way to do this from Ignition or you need to do this from Windows.
For example, the timers stopped at 2 am, my investigation was at 9am, I do sleep!
So how do I get the thread dump event if this is 7 hours later.

Kevin.Herron · November 21, 2019, 3:31pm

There’s a link to download a thread dump from the thread viewer page in the gateway. It doesn’t matter if you do it 7 hours later, as long as you haven’t restarted the gateway yet. If the threads were stuck/blocked for some reason 7 hours ago they will still be blocked when you go to retrieve the thread dump.

Kevin.Herron · November 21, 2019, 3:33pm

and a lo-fi way of troubleshooting and monitoring this would be to put print statements at the beginning and end of these scripts running on the timer.

It will be useful to see if they always happen in pairs, or if you get one “start” printout and never see a “finished”.

patrick.garceau · November 25, 2019, 1:09pm

I have a third site that stopped also at midnight sharp. after months or year of data collection.
This time took a thread dump.
ThreadDump.txt (164.7 KB)

Kevin.Herron · November 25, 2019, 1:24pm

Is the “stuck” one called “Axium”?

This thread seems to waiting inside the redshift JDBC driver for a query to execute.

Daemon Thread [gateway-script-shared-timer-[Axium]] id=64, (TIMED_WAITING)
	owns monitor: com.amazon.redshift.core.jdbc42.PGJDBC42Statement@6f5e2703
	waiting for: java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@5ca53be1
	sun.misc.Unsafe.park(Native Method)
	java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
	java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
	java.util.concurrent.ArrayBlockingQueue.poll(Unknown Source)
	com.amazon.jdbc.communications.InboundMessagesPipeline.validateCurrentContainer(Unknown Source)
	com.amazon.jdbc.communications.InboundMessagesPipeline.getNextMessageOfClass(Unknown Source)
	com.amazon.redshift.client.PGMessagingContext.doMoveToNextClass(Unknown Source)
	com.amazon.redshift.client.PGMessagingContext.getReadyForQuery(Unknown Source)
	com.amazon.redshift.client.PGMessagingContext.closeOperation(Unknown Source)
	com.amazon.redshift.dataengine.PGAbstractQueryExecutor.close(Unknown Source)
	com.amazon.jdbc.common.SStatement.replaceQueryExecutor(Unknown Source)
	com.amazon.jdbc.common.SStatement.executeNoParams(Unknown Source)
	com.amazon.jdbc.common.SStatement.executeNoParams(Unknown Source)
	com.amazon.jdbc.common.SStatement.executeQuery(Unknown Source)
	org.apache.commons.dbcp.DelegatingStatement.executeQuery(DelegatingStatement.java:208)
	org.apache.commons.dbcp.PoolableConnectionFactory.validateConnection(PoolableConnectionFactory.java:332)
	org.apache.commons.dbcp.PoolableConnectionFactory.validateObject(PoolableConnectionFactory.java:312)
	org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:991)
	org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:96)
	org.apache.commons.dbcp.BasicDataSource.getConnection(BasicDataSource.java:880)
	com.inductiveautomation.ignition.gateway.datasource.DatasourceImpl.getConnectionInternal(DatasourceImpl.java:242)
	com.inductiveautomation.ignition.gateway.datasource.DatasourceManagerImpl.getConnectionImpl(DatasourceManagerImpl.java:159)
	com.inductiveautomation.ignition.gateway.datasource.DatasourceImpl.getConnection(DatasourceImpl.java:235)
	com.inductiveautomation.ignition.gateway.datasource.DatasourceManagerImpl.getConnection(DatasourceManagerImpl.java:142)
	com.inductiveautomation.ignition.gateway.script.GatewayDBUtilities.getConnection(GatewayDBUtilities.java:88)
	com.inductiveautomation.ignition.gateway.script.GatewayDBUtilities.getConnection(GatewayDBUtilities.java:73)
	com.inductiveautomation.ignition.gateway.script.GatewayDBUtilities._runQuery(GatewayDBUtilities.java:157)
	com.inductiveautomation.ignition.common.script.builtin.AbstractDBUtilities.runQuery(AbstractDBUtilities.java:329)
	sun.reflect.GeneratedMethodAccessor51.invoke(Unknown Source)
	sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	java.lang.reflect.Method.invoke(Unknown Source)
	org.python.core.PyReflectedFunction.__call__(PyReflectedFunction.java:186)
	com.inductiveautomation.ignition.common.script.ScriptManager$ReflectedInstanceFunction.__call__(ScriptManager.java:431)
	org.python.core.PyObject.__call__(PyObject.java:404)
	org.python.core.PyObject.__call__(PyObject.java:408)
	org.python.pycode._pyx3.GetMissingHours$1(<module:project.Queries>:106)
	org.python.pycode._pyx3.call_function(<module:project.Queries>)
	org.python.core.PyTableCode.call(PyTableCode.java:165)
	org.python.core.PyBaseCode.call(PyBaseCode.java:120)
	org.python.core.PyFunction.__call__(PyFunction.java:307)
	org.python.pycode._pyx2.f$0(<TimerScript:Axium/MissingMinutes @120,000ms >:1)
	org.python.pycode._pyx2.call_function(<TimerScript:Axium/MissingMinutes @120,000ms >)
	org.python.core.PyTableCode.call(PyTableCode.java:165)
	org.python.core.PyCode.call(PyCode.java:18)
	org.python.core.Py.runCode(Py.java:1275)
	com.inductiveautomation.ignition.common.script.ScriptManager.runCode(ScriptManager.java:636)
	com.inductiveautomation.ignition.common.script.ScriptManager.runCode(ScriptManager.java:603)
	com.inductiveautomation.ignition.common.script.TimerScriptTask.run(TimerScriptTask.java:88)
	java.util.TimerThread.mainLoop(Unknown Source)
	java.util.TimerThread.run(Unknown Source)

patrick.garceau · November 25, 2019, 1:54pm

Axium is the name of the project.
There is two gateway script.
The first one will generate a file with all the tags in it and copy this to the hard disk.
The second thread will verify in Redshift if the data is there (from the first timer).

Are you saying that the second timer has stopped the first timer that has nothing to do with it?
The reason for two timers is to make sure that everything is separate.
Why would it had stopped at midnight? The same happened with two other sites. We are now getting a failure of about 10% based on every sites installed.
Also if the second script that does the Query to the database has a lock or is stalling, shouldn’t this have a timeout where it release everything and send a failure to the request?
Still don’t get the fact that a second timer would have stopped every timer on the Gateway, it just doesn’t make any senses.

Kevin.Herron · November 25, 2019, 1:59pm

There's no timeouts as far as script execution is concerned, but you can look to whatever documentation is available for the JDBC driver to see if there's a connection property that lets you set a timeout. With some JDBC drivers the default timeout is 0, which is no timeout (or infinite timeout).

You're using a shared timer (you can see in the thread name), which means yes, this timer script blocks the other ones from executing. Mark each timer script as dedicated to prevent this.