I have dataset properties with Tag History bindings, as well as EasyCharts. Occasionally this type of error is thrown and the Gateway is automatically restarted. Anecdotally we thought this would occur if the start and end range was too large, but seen here - the range is small. Can anyone help me troubleshoot this error?
In this case there were two errors generated by the EasyCharts:
Exception: Error running query:
TagHistory(paths=[histprov:RYDB:/prov:RYTP:/tag:AWERK/automatic mode, histprov:RYDB:/prov:RYTP:/tag:AWERK/circuit breaker fault, histprov:RYDB:/prov:RYTP:/tag:AWERK/collective fault emergency off, prov:RYTP:/tag:awerk/sim/fs1, prov:RYTP:/tag:awerk/sim/fv1, prov:RYTP:/tag:awerk/sim/fs2, prov:RYTP:/tag:awerk/sim/fv2], start=Mon Jun 18 00:00:00 EDT 2018, end=Mon Jun 18 00:00:01 EDT 2018, flags=0)@3000ms
On: 313 AWERK hist.Root Container.cht hist
caused by GatewayException: Read timed out
caused by SocketTimeoutException: Read timed out
Ignition v7.9.9 (b2018081621)
Java: Oracle Corporation 1.8.0_171
Exception: Error running query:
TagHistory(paths=[prov:RYTP:/tag:awerk/sim/fs1, prov:RYTP:/tag:awerk/sim/fv1, prov:RYTP:/tag:awerk/sim/fs2, prov:RYTP:/tag:awerk/sim/fv2], start=Fri Sep 21 15:17:26 EDT 2018, end=Fri Sep 21 15:17:49 EDT 2018, flags=0)@3000ms
On: 310 AWERK.Root Container.cht trend
caused by GatewayException: Read timed out
caused by SocketTimeoutException: Read timed out
Ignition v7.9.9 (b2018081621)
Java: Oracle Corporation 1.8.0_171
ClockDriftDetector shows up in there with 2.5 second deviations. As noted in that message, there’s only three real possibilities: 1) server is bogged down, 2) clock is changing erratically, sometimes a VM setup problem, or 3) pause-the-world is taking too long.
In my experience, it has always been #3, due to the use of the old CMS garbage collector. One case of #2 has shown up here, though, so it isn’t certain. Visit the Performance page of the gateway status section and look at the graphs. If the ClockDriftDetector events correlate with peak memory usage, it is almost certainly the garbage collector.
You will need direct access to the gateway to do much of anything.
That is correct - The CPU Trend spikes to 70% (typical being 8%) and the Memory Trend spikes to 4 GB (typical being 0.5%) at the same time the ClockDrift warnings are logged.
Additionally, the current architecture is very poor - the Gateway server is remote, and located whole States away from the client site. I have tried to address this but the client does not want to adjust. (The application is read-only at the moment, otherwise I would not develop with this kind of latency). The tags consistently blip in and out of good connection. I was wondering if that could impact the clock drift?
And, ultimately, should I write this off as a product of the poor architecture.
The tags are probably going bad because the garbage collection pause-the-world cuts off communication too long. See the related timeouts for your drivers. Running OPC drivers on a remote server is disastrous for performance. Most PLC protocols are very latency sensitive, as only a single request (or very few) can be issued at a time. But that doesn’t make timeouts.
Look on your gateway’s performance page again: Under the memory chart, you should see “Garbage Collectors: G1 Young Generation, G1 Old Generation”. If you see anything else, you need to change ignition.conf to use G1GC.
Hmm. You really need the wrapper log, not just the web-accessible log. So you can see why the gateway is restarting. You might just be doing something that takes more than 4GB of memory. A scheduled report, perhaps? Reports are memory hogs.
I will request the logs - I feel the application is pretty light at this point. No reports, and query runs and refreshes are event driven. Yet, maybe I’ve done something inefficient regardless.
@pturmel I received the wrapper log file. I will post it here, and I will appreciate any further assistance - however I don’t want to ask it of you to read through the logs! I will open a service ticket with IA.
There were two crashes - one around 12:12 PM and another around 1:26 PM
The latency heavy network architecture does cause queries to randomly increase duration by more than a 10x factor. In addition, I had a script that would fire 4 simultaneous history binding queries. When the query lag coincided with this query event, the RAM usage would spike above the available 4 GB limit and induce a crash.