Abnormal RAM usage and gateway reboot

Mau89 · August 22, 2017, 7:54am

Hi everybody,
in my application I have the following problem: gateway normally uses about 200-300 MB of RAM, but sometimes, the ram usage grow up to 8GB(max ram setted in ignition.conf). After this the gateway restarts itself.

I had verified if some in script or report (there aren’t).
In the log there is only warnning of clockdrift but it is a conseguence in my opinion.

Any idea? There are logs to increase level (to trace, debug etc.) that might be useful to find the cause of problem?

Thank you.

Sanderd17 · August 22, 2017, 9:31am

How long does it take for the ram to grow? Is it a matter of minutes or a matter of days?

I have to say I think 8 GB is quite a lot to assign when you only need 200-300 MB typically. Java has a tendency to build up RAM usage until it really needs to collect the garbage. If there’s a lot of garbage to collect, it might put extra stress on the system. Certainly if parts of the old data were already moved to the slow pagefile.

If the memory grows over a long time, and then crashes, it might be worth to lower the maximum memory and check what happens. 1GB should be plenty if it typically only needs 200-300 MB. You may also want to look into better garbage collectors, like the G1GC.

If it builds up in a matter of minutes, then there’s probably a real bug somewhere in a script or a configuration. The first thing I would do would be symptom hunting: check which parts you can disable on a test server until it doesn’t happen again.

If that doesn’t help, or if the RAM issues are too infrequent to know when you solved it, attaching a debugger is probably the only way to go, though I don’t know about good ones for Java.

Mau89 · August 22, 2017, 10:16am

Hi Sandered17,
the RAM usage grow up in a bit of seconds, to the max one minute. As frequency it is random, sometime in a couple of days, sometime in some weeks or a month.

Thanks.

Sanderd17 · August 22, 2017, 12:06pm

That also means you can’t log continually in trace mode, as it would put extra stress on the system for a long time. Profilers are also generally not designed to run that long.

Your best bet is probably to put a trigger on the memory consumption, see the tag [System]Gateway/Performance/Memory Utilization

When that value gets too high, start logging thread dumps, details about open clients, etc

https://support.inductiveautomation.com/usermanuals/ignition/index.html?system_util_threaddump.htm

Perhaps that way you can figure out where the cause of the problem is, and start reproducing it more reliably.

pturmel · August 22, 2017, 2:43pm

Consider using Oracle’s jmap tool to capture a histogram of per-class memory utilization during normal operations (as a baseline) then on a 95% memory utilization trigger. Note that this tool does cause the equivalent of a GC pause while gathering the stats, so you should not use it under normal operation other than to get your baseline (or baselines).

matsons · August 22, 2017, 5:21pm

We have had the same problem. The memory utilization is stable at around 1 GB for 3 weeks+ and then memory use jumps by 3-4 GB nearly instantaneously and is never released. Sometime 1 event causes the “pause the world” and the service crashes and restarts and sometimes it takes two events.

We have changed over to use the newer garbage collection but this has not eliminated the problem.
We have implemented a script to do a thread dump when memory utilization is above 95% but the event occurs so quickly that the dump event is not triggered.
I have looked at the wrapper logs and the only thing I can find taking place near the memory jumps is one of our users launching many (5-7) projects within 1-2 seconds of each other. Not sure if this is related but it is all I have found so far.

Thanks

COVBob · August 22, 2017, 6:03pm

Have a ticket in for this as well except that I’ve watched the memory build up over a period of 12 hours. Changed the garbage collectors, no joy. Went to 7.9.3 and new java version, haven’t seen it in 3 weeks. Not calling it fixed, just haven’t seen it for a few weeks.