Read Timeout Error Causes Gateway Crash - Tag History Binding

bill.diblasi · September 25, 2018, 4:26pm

I have dataset properties with Tag History bindings, as well as EasyCharts. Occasionally this type of error is thrown and the Gateway is automatically restarted. Anecdotally we thought this would occur if the start and end range was too large, but seen here - the range is small. Can anyone help me troubleshoot this error?

In this case there were two errors generated by the EasyCharts:

Exception: Error running query:
TagHistory(paths=[histprov:RYDB:/prov:RYTP:/tag:AWERK/automatic mode, histprov:RYDB:/prov:RYTP:/tag:AWERK/circuit breaker fault, histprov:RYDB:/prov:RYTP:/tag:AWERK/collective fault emergency off, prov:RYTP:/tag:awerk/sim/fs1, prov:RYTP:/tag:awerk/sim/fv1, prov:RYTP:/tag:awerk/sim/fs2, prov:RYTP:/tag:awerk/sim/fv2], start=Mon Jun 18 00:00:00 EDT 2018, end=Mon Jun 18 00:00:01 EDT 2018, flags=0)@3000ms
On: 313 AWERK hist.Root Container.cht hist
	caused by GatewayException: Read timed out
	caused by SocketTimeoutException: Read timed out

Ignition v7.9.9 (b2018081621)
Java: Oracle Corporation 1.8.0_171

Exception: Error running query:
TagHistory(paths=[prov:RYTP:/tag:awerk/sim/fs1, prov:RYTP:/tag:awerk/sim/fv1, prov:RYTP:/tag:awerk/sim/fs2, prov:RYTP:/tag:awerk/sim/fv2], start=Fri Sep 21 15:17:26 EDT 2018, end=Fri Sep 21 15:17:49 EDT 2018, flags=0)@3000ms
On: 310 AWERK.Root Container.cht trend
	caused by GatewayException: Read timed out
	caused by SocketTimeoutException: Read timed out

Ignition v7.9.9 (b2018081621)
Java: Oracle Corporation 1.8.0_171

pturmel · September 25, 2018, 4:30pm

No, it is Gateway Crash causes read timeout error.

You need to look in your wrapper log on your gateway to see why it restarted.

bill.diblasi · September 25, 2018, 4:42pm

I see, thank you.

I do not have direct access to the server, I copied some logs from the Gateway - these are the first ones that show odd behavior.

ResultWriter	            25Sep2018 12:13:02	Error executing historical tag read.
ResultWriter	            25Sep2018 12:13:02	Error executing historical tag read.
ResultWriter	            25Sep2018 12:13:02	Error executing historical tag read.
ResultWriter	            25Sep2018 12:13:01	Error executing historical tag read.
ResultWriter	            25Sep2018 12:13:01	Error executing historical tag read.
ResultWriter	            25Sep2018 12:13:01	Error executing historical tag read.
ResultWriter	            25Sep2018 12:12:59	Error executing historical tag read.
ResultWriter	            25Sep2018 12:12:59	Error executing historical tag read.
ResultWriter	            25Sep2018 12:12:59	Error executing historical tag read.
MemoryForwardTransaction	25Sep2018 12:12:59	Error forwarding data
DB	                        25Sep2018 12:12:59	(id=396270d2) Error occurred while loading tag information.
DB	                        25Sep2018 12:12:59	(id=fefeacdb) Error occurred while loading tag information.
ResultWriter	            25Sep2018 12:12:59	Error executing historical tag read.
ResultWriter	            25Sep2018 12:12:59	Error executing historical tag read.
DB	                        25Sep2018 12:12:59	(id=4135ec3b) Error occurred while loading tag information.
ABSLCReadRequest	        25Sep2018 12:12:59	Read Request timeout. First item: B3:2/11, Timeout Count: 1, Transaction: 3458
TimeoutDaemon	            25Sep2018 12:12:59	ScheduledRequest[com.inductiveautomation.xopc.drivers.allenbradley.requests.ABSLCReadRequest@f24342f] request with key "3458" failed due to timeout.
ClockDriftDetector	        25Sep2018 12:12:59	Clock drift, degraded performance, or pause-the-world detected. Max allowed deviation=1000ms, actual deviation=2886ms
Database	                25Sep2018 12:12:58	Database connection FAULTED.
ABSLCReadRequest	        25Sep2018 12:12:58	Read Request timeout. First item: F29:0, Timeout Count: 1, Transaction: 9041
TimeoutDaemon	            25Sep2018 12:12:58	ScheduledRequest[com.inductiveautomation.xopc.drivers.allenbradley.requests.ABSLCReadRequest@1d0de72f] request with key "9041" failed due to timeout.
ABSLCReadRequest	        25Sep2018 12:12:54	Read Request timeout. First item: B107:2/4, Timeout Count: 1, Transaction: 9537
TimeoutDaemon	            25Sep2018 12:12:54	ScheduledRequest[com.inductiveautomation.xopc.drivers.allenbradley.requests.ABSLCReadRequest@6a7b06a5] request with key "9537" failed due to timeout.
ClockDriftDetector	        25Sep2018 12:12:53	Clock drift, degraded performance, or pause-the-world detected. Max allowed deviation=1000ms, actual deviation=2646ms
ABSLCReadRequest	        25Sep2018 12:12:52	Read Request timeout. First item: N11:30, Timeout Count: 1, Transaction: 3456
TimeoutDaemon	            25Sep2018 12:12:52	ScheduledRequest[com.inductiveautomation.xopc.drivers.allenbradley.requests.ABSLCReadRequest@b1bb194] request with key "3456" failed due to timeout.

pturmel · September 25, 2018, 6:09pm

ClockDriftDetector shows up in there with 2.5 second deviations. As noted in that message, there’s only three real possibilities: 1) server is bogged down, 2) clock is changing erratically, sometimes a VM setup problem, or 3) pause-the-world is taking too long.

In my experience, it has always been #3, due to the use of the old CMS garbage collector. One case of #2 has shown up here, though, so it isn’t certain. Visit the Performance page of the gateway status section and look at the graphs. If the ClockDriftDetector events correlate with peak memory usage, it is almost certainly the garbage collector.

You will need direct access to the gateway to do much of anything.

bill.diblasi · September 25, 2018, 6:20pm

Thank you @pturmel

That is correct - The CPU Trend spikes to 70% (typical being 8%) and the Memory Trend spikes to 4 GB (typical being 0.5%) at the same time the ClockDrift warnings are logged.

Additionally, the current architecture is very poor - the Gateway server is remote, and located whole States away from the client site. I have tried to address this but the client does not want to adjust. (The application is read-only at the moment, otherwise I would not develop with this kind of latency). The tags consistently blip in and out of good connection. I was wondering if that could impact the clock drift?

And, ultimately, should I write this off as a product of the poor architecture.

pturmel · September 25, 2018, 6:27pm

The tags are probably going bad because the garbage collection pause-the-world cuts off communication too long. See the related timeouts for your drivers. Running OPC drivers on a remote server is disastrous for performance. Most PLC protocols are very latency sensitive, as only a single request (or very few) can be issued at a time. But that doesn’t make timeouts.

Look on your gateway’s performance page again: Under the memory chart, you should see “Garbage Collectors: G1 Young Generation, G1 Old Generation”. If you see anything else, you need to change ignition.conf to use G1GC.

bill.diblasi · September 25, 2018, 6:30pm

Currently the Performance page does show only G1 Young Generation, G1 Old Generation garbage collectors

pturmel · September 25, 2018, 6:36pm

Hmm. You really need the wrapper log, not just the web-accessible log. So you can see why the gateway is restarting. You might just be doing something that takes more than 4GB of memory. A scheduled report, perhaps? Reports are memory hogs.

bill.diblasi · September 25, 2018, 6:41pm

I will request the logs - I feel the application is pretty light at this point. No reports, and query runs and refreshes are event driven. Yet, maybe I’ve done something inefficient regardless.

bill.diblasi · September 25, 2018, 6:56pm

@pturmel I received the wrapper log file. I will post it here, and I will appreciate any further assistance - however I don’t want to ask it of you to read through the logs! I will open a service ticket with IA.

There were two crashes - one around 12:12 PM and another around 1:26 PM

wrapper 9_25.log (304.6 KB)

bill.diblasi · September 27, 2018, 3:29am

I wanted to update with the final result:

The latency heavy network architecture does cause queries to randomly increase duration by more than a 10x factor. In addition, I had a script that would fire 4 simultaneous history binding queries. When the query lag coincided with this query event, the RAM usage would spike above the available 4 GB limit and induce a crash.