Memory Leakage / Garbage Collection Issue

byrnep · August 27, 2019, 11:55pm

Hi all.

So our gateway was eating up huge amounts of memory (9 Gigs) in a repeating, sawtoothing pattern. I'd assumed that this was a garbage collection issue, and following Phil Turmel's advice here:

Did seemingly fix the problem. Memory useage was still saw-toothy, but the peaks never got higher than about 1GB before falling down again.

However, since making that change, we saw at one point memory useage go through a 'phase transistion' where all of a sudden a much higher rate of memory useage, and a much higher level of memory needed to be blocked before the garbage collection algorithm would clear things up. Fortunately, the spikes were lower than before (4-5 Gig before collection) but that's still concerning, especially since it had only been a month or so since they change in settings.

During a maintenance period, we tried fixing this. We'd closed out all clients, all designers, everything, and found that rebooting the gateway computer fixed things. Memory useage is back down to about where it was. Thing is we don't know what caused this, and can not regularly cycle this computer as we rely on it during facility operations.

Any guesses what could cause the memory management to change suddenly like that, and be fixed by a reboot, while using G1GC?

Kevin.Herron · August 28, 2019, 12:14am

My guess would be a change in memory pressure provoked it (maybe some big history or alarm queries going off?) and then the heuristics of the GC in use acted accordingly.

I don’t think anything is wrong here, though, since you aren’t actually experiencing a memory leak.

Kevin.Herron · August 28, 2019, 12:17am

If it happens again don’t restart the gateway. Let it go for a while and see what happens.

If the gateway eventually restarts itself because the service wrapper says the JVM isn’t responding, and you starting see warnings from the Clock Drift Detector, then maybe there is a leak.

Sanderd17 · August 28, 2019, 12:10pm

It's not a good idea to give Ignition access to that much memory. Ignition should never start to use the swap memory, as that makes garbage collection unbearably slow. It's best to monitor normal usage, and add a margin of f.e. 50% to determine your best maximum memory setting. And make sure the OS has enough memory to fulfill that requested memory.

Letting it use 900% of it's normal use will make any odd memory spikes more severe.

Apart from that, try to find a pattern in when it happens. I haven't seen this mentioned here with G1GC (we also use G1GC for all our projects).

Perhaps it's caused by some special script you made. It's possible to have a memory leak in Python too: f.e. when you spawn but never close threads, the objects remain accessible from within the thread, so are never garbage collected. But the threads themselves may be unreachable from your python code.

pturmel · August 28, 2019, 1:49pm

I would use jmap to obtain memory allocation histograms during normal and excursion conditions. The downside is that jmap will almost certainly provoke an execution pause while it grabs its stats.

byrnep · August 28, 2019, 2:20pm

Thanks, Phil.

Looks like Jmap is going to be a project with Windows, at least it was for this guy. Including as for posterity. FWIW, what he's describing is exactly what we're seeing, though no resolution is posted.

pturmel · August 28, 2019, 2:32pm

I vaguely recall skipping that topic. My condolences.

{ I’ll tack this on to my ever-growing list of reasons to never ship Windows in any system of mine. }

Kevin.Herron · August 28, 2019, 3:02pm

I’ll say again… I don’t think there’s any problems here.

You gave the JVM a 9gb max and it eventually used it, as expected. Your graph shows ~10 collections in the span of 4 hours.

byrnep · August 28, 2019, 4:00pm

When you say ‘Let it go for awhile’ you did notice that the high-mem-use regime started on 8-14-2019 and ended only when we rebooted on 8-24-2019. Various clients, alarms, designer sessions were started and ended during this period. How long is ‘a while’ if 10 days doesn’t cut it?

So, is it usual for a java application, regardless of garbage collection algorithm, regardless of how that application is set up, to use all memory made available to it?

If that’s the case, then it should make no difference WHAT the ceiling is, since it will always be reached. Are you suggesting we could get away with assigning 700MB (since after a reboot, that seems to be the minimum level of memory needed to run the gateway and clients + 50%) and leaving it there?

Because that doesn’t track with our experience. Before switching to G1GC, when the memory useage would run up against the 4GB limit we’d originally given it (we have 16 on the server), there would be noticeable slowdowns, components on clients would show errors, and we’d have ClockDrift errors until the garbage was collected.

Kevin.Herron · August 28, 2019, 4:09pm

Yes!

For a small application, maybe. For a larger one this will result in very frequent collections, assuming it can keep up, and probably won't perform well.

Yep... the previous GC algorithm "paused the world" to clean up, and it was noticeable in the application when that happened. G1GC is much better about this. It will still pause the world, but less often and for a shorter duration.

pturmel · August 28, 2019, 4:11pm

That's the key. Different algorithms use different heuristics for triggering GC of various depths. With G1GC, the higher the % of max used, the more likely a deep-clearance GC will run (with a likely brief pause-the-world). The older CMS algorithm tended to avoid deep clearance until forced by lack of free memory, and it would produce horrendous pauses. G1GC gets a lot more done in its shallow clearances.

Java memory usage is naturally a sawtooth under all algorithms. With G1GC, I recommend, after monitoring, setting max memory so that the top of the sawtooth is between 80% and 90% of max, and the trough of the sawtooth is less than 50% of max.

byrnep · August 28, 2019, 6:31pm

So, again, after switching to G1GC, we saw that jump up from 1GB-4GB. It wasn’t a slow drift, it was a very clear and discrete change between regimes.

We started from a very ‘hairy’ garbage collection happening frequently, about once a minute, overlaid on a longer 20 hour cycle that recycled at about 1GB. Then all of a sudden, to a slower - as Kevin noticed, 10 collections in 4 hours - collection that allowed garbage to pile up to several gigs, and was not superimposed on any slower fluctuation in memory use. 4GB is not as much use as when we switched, but there’s of course no reason to assumed 4GB is some natural maximum, or that 0.4hr is some natural frequency. If it runs slower next time, we’ll pile more garbage, and in theory, no ceiling is high enough.

It’s that change in behavior, that came on suddenly, and persisted until a reboot that is worrying. I’d very much like to peg us at some 120% of sawtooth peak, but if I had done so immediately after switching to G1GC, I would have pegged us at 1.2GB, and we would have been absolutely murdered with reset the worlds when the algorithm went through that behavior change.

pturmel · August 28, 2019, 7:34pm

I don't think so. During the period where memory usage was peaking around 4G, it was deep-cleaning without apparent problem. A lower max memory would have made the garbage collector deep-clean more frequently, with less to do on each. With correspondingly shorter pause-the-world. Based on your chart, you should give Ignition about 1.5G.

byrnep · August 30, 2019, 5:20pm

Thanks,

Ok, so if I understand, the program using all of the memory assigned to it is only a problem if that memory is a significant amount of total system resources. Bumping up against a low ceiling should be fine.

One new fact we’ve discovered is that the rapid onset of higher mem use and less frequent collection is related to losing connection to our main OPC Server. We see all tags served by it as red ‘x’ marked. However on the gateway the server is not listed as ‘faulted’ but as ‘connecting’ permanently. Disabling and re-enabling the connection instantly re-establishes the connection, but in so doing, CPU and memory useage spike. CPU useage hits about 60% during the period when the 4GB or so of garbage is being cleared, but hovers around 20-30% just generally.

If after restarting the connection, we restart the gateway, CPU useage sits around 4% always.

Is any of this suggestive? I took a threaddump from the Gateway Control Utility before cycling the gateway, but I’m not sure I’m competent to parse the output.

Kevin.Herron · August 30, 2019, 5:32pm

This seems like a different problem that I’d rather troubleshoot in another thread, if not via support.

What version of Ignition are you using and what is the “main” OPC UA server are you connected to?

It’s not weird that you see a spike of CPU and memory churn when it reconnects because there’s a bunch of browsing and subscription re-establishment happening, so you’ve momentarily left the “steady state”.

What is weird is that it doesn’t reconnect on its own and you have to disable/enable it.

byrnep · August 30, 2019, 8:15pm

It’s not weird that you see a spike of CPU and memory churn when it reconnects because there’s a bunch of browsing and subscription re-establishment happening, so you’ve momentarily left the “steady state”.

Disabling and re-enabling the connection instantly re-establishes the connection, but in so doing, CPU and memory useage spike. CPU useage hits about 60% during the period when the 4GB or so of garbage is being cleared, but hovers around 20-30% just generally.

When you say ‘Let it go for awhile’ you did notice that the high-mem-use regime started on 8-14-2019 and ended only when we rebooted on 8-24-2019. Various clients, alarms, designer sessions were started and ended during this period. How long is ‘a while’ if 10 days doesn’t cut it?

The problem persists arbitrarily long. We have not observed it to go away on its own without a reboot. It's not 'normal' churn

Kevin.Herron · August 30, 2019, 8:17pm

What version are you on? This sounds like a bug in one of the later 7.9.x versions.

brian.mcenulty · August 30, 2019, 8:48pm

I’ve seen similar memory usage trends on at least two installations (one running 7.9.9, one at 7.9.10) that run at some baseline with the normal sawtooth trend and then that baseline jumps up to 300% or 400% of what it was and it runs at the new baseline until the gateway server is restarted. Both are running Windows Server 2016 and are using the G1GC garbage collector.

byrnep · August 30, 2019, 9:45pm

Yes, we’re on 7.9.10.

We’d tried updating to version 8, but found several of our projects simply broke. Given that even the fonts weren’t displaying correctly, we decided to wait out a few sub-versions and let the bugs get fixed before trying to sort out what issues were things we needed to address.

byrnep · August 30, 2019, 9:47pm

Also a good guess. We are indeed using the G1GC and a windows based system. I’ll have to find out whether we are using Windows Server 2016. Do you mean on the Gateway computer, or on the OPC Server that is correlated with starting this behavior? I know that that particular server is serving specifically tags from a GE PLC network we have on site. I’m not sure if it’s running any particular server software besides what it comes packaged with.