Ignition Gateway restarts Itself (v8.1.18 16GB Memory allocation)

Yesterday we concurrently updated the gateway memory allocation (8GB to 16GB) and updated the ignition software version (v8.1.17 to v 8.1.18).
Today we have had a number of instances where the gateway appear to have restarted itself, preceded by ClockDriftDetector and Timeout warnings.
My first thought was to look at garbage collection due to things I had read in the past but I understand that Ignition now uses the recommended G1C1 garbage collector by default.
We are investigating further but I was hoping to solicit input from others who might intuitively understand potential causes and resolutions to this problem.

Partial log excerpts from what appears to be the beginning of problems leading up to the unguided gateway restart…

Cross-linking to similar threads (when I find them) for reference:

Sounds like a memory leak leading to exhaustion. GC crises lead to failed memory allocations lead to errors all over the place, so oddball errors are not so meaningful. I recommend getting support to help you do a deep dive with jhat and/or Flight Recorder.

OK. We have reduced the memory allocation from 16GB back to 8GB and will observe if we get any further Clock Drift warnings or similar issues. So far things seem stable.

So the system memory usage appeared stable until this morning when it suddenly jumped to 100% utilization. And triggered a restart of the gateway.

Log filtered for “drift”

I’ll echo the recommendation to contact support.

2 Likes

Do you have a complex reporting or data mining script that kicks off shortly after 7am ?

Another event occurred, 3 hours after the one earlier today (within a few minutes, not exact).

Given the stability all evening, then the recurrence starting in the morning when user interaction increases I’m suspecting (not certain) that some form of user initiated action may be triggering the issue. Looking into scheduled and user triggered script executions.

Submitted a support request (ID #53024).

1 Like

I am interested in what the cause was for you and the resolution.

The other thing that is likely to happen in the morning I think, is for more sessions to be opened.

We now have a suspected cause. It looks like it may be related to use of a view for showing audit profile data and a poorly formed SQL query to the Audit_Events table. Something that was started and deprioritized/forgotten before it was finished. We found a “SELECT * FROM AUDIT_EVENTS” with about 92+million audit records in that table… Yes, the query should have been more specific and limited in a variety of ways. We have so many audit records, despite pruning at 90 days, due to having many scripted tag writes each second and we didn’t previously realize those would be added to the audit log.

I wasn’t able to track that down with log messages etc. we just happened to notice some association with navigation to a certain page and the gateway restart.

I couldn’t spend too much time verifying/reproducing the behavior as the current priority is uptime but even just opening the view in designer seemed to trigger the memory spike and brought the gateway down in seconds… oddly enough that was with gateway communications disabled. Many hands in the pot on this one (developers and active user) so its hard to be certain of exact circumstances, perhaps someone else navigated to the view before I disabled it.

image

2 Likes

Quick update on this issue…
We disabled the ability for users to initiate query of the audit log from the front end (Perspective Session) and that appeared to have resolved the issue with recurring gateway restarts.
However we found that using the view log feature from the Config > Security > Auditing gateway menu also causes the same issue (memory and CPU spike to 100% and the gateway restarts itself).

We have since disabled auditing in the project properties (for now) and set retention to 1 day in the configured audit profile to purge existing audit records (over the next day).

@PGriffith Ignition developers may want to be aware of the ability to crash the gateway by clicking on a built in feature in the gateway interface. Granted we have 642 million audit records due to having many (and frequent) scripted tag writes and that is likely abnormal for most implementations.
image.

The audit log page, though it’s got shiny new chrome on it, is actually one of the oldest pages still around on the gateway. In 8.2 we’re planning to revamp the web interface fundamentally, which should hopefully fix this problem.