So our gateway was eating up huge amounts of memory (9 Gigs) in a repeating, sawtoothing pattern. I'd assumed that this was a garbage collection issue, and following Phil Turmel's advice here:
Did seemingly fix the problem. Memory useage was still saw-toothy, but the peaks never got higher than about 1GB before falling down again.
However, since making that change, we saw at one point memory useage go through a 'phase transistion' where all of a sudden a much higher rate of memory useage, and a much higher level of memory needed to be blocked before the garbage collection algorithm would clear things up. Fortunately, the spikes were lower than before (4-5 Gig before collection) but that's still concerning, especially since it had only been a month or so since they change in settings.
During a maintenance period, we tried fixing this. We'd closed out all clients, all designers, everything, and found that rebooting the gateway computer fixed things. Memory useage is back down to about where it was. Thing is we don't know what caused this, and can not regularly cycle this computer as we rely on it during facility operations.
Any guesses what could cause the memory management to change suddenly like that, and be fixed by a reboot, while using G1GC?
My guess would be a change in memory pressure provoked it (maybe some big history or alarm queries going off?) and then the heuristics of the GC in use acted accordingly.
I donāt think anything is wrong here, though, since you arenāt actually experiencing a memory leak.
If it happens again donāt restart the gateway. Let it go for a while and see what happens.
If the gateway eventually restarts itself because the service wrapper says the JVM isnāt responding, and you starting see warnings from the Clock Drift Detector, then maybe there is a leak.
It's not a good idea to give Ignition access to that much memory. Ignition should never start to use the swap memory, as that makes garbage collection unbearably slow. It's best to monitor normal usage, and add a margin of f.e. 50% to determine your best maximum memory setting. And make sure the OS has enough memory to fulfill that requested memory.
Letting it use 900% of it's normal use will make any odd memory spikes more severe.
Apart from that, try to find a pattern in when it happens. I haven't seen this mentioned here with G1GC (we also use G1GC for all our projects).
Perhaps it's caused by some special script you made. It's possible to have a memory leak in Python too: f.e. when you spawn but never close threads, the objects remain accessible from within the thread, so are never garbage collected. But the threads themselves may be unreachable from your python code.
I would use jmap to obtain memory allocation histograms during normal and excursion conditions. The downside is that jmap will almost certainly provoke an execution pause while it grabs its stats.
Looks like Jmap is going to be a project with Windows, at least it was for this guy. Including as for posterity. FWIW, what he's describing is exactly what we're seeing, though no resolution is posted.
When you say āLet it go for awhileā you did notice that the high-mem-use regime started on 8-14-2019 and ended only when we rebooted on 8-24-2019. Various clients, alarms, designer sessions were started and ended during this period. How long is āa whileā if 10 days doesnāt cut it?
So, is it usual for a java application, regardless of garbage collection algorithm, regardless of how that application is set up, to use all memory made available to it?
If thatās the case, then it should make no difference WHAT the ceiling is, since it will always be reached. Are you suggesting we could get away with assigning 700MB (since after a reboot, that seems to be the minimum level of memory needed to run the gateway and clients + 50%) and leaving it there?
Because that doesnāt track with our experience. Before switching to G1GC, when the memory useage would run up against the 4GB limit weād originally given it (we have 16 on the server), there would be noticeable slowdowns, components on clients would show errors, and weād have ClockDrift errors until the garbage was collected.
For a small application, maybe. For a larger one this will result in very frequent collections, assuming it can keep up, and probably won't perform well.
Yep... the previous GC algorithm "paused the world" to clean up, and it was noticeable in the application when that happened. G1GC is much better about this. It will still pause the world, but less often and for a shorter duration.
That's the key. Different algorithms use different heuristics for triggering GC of various depths. With G1GC, the higher the % of max used, the more likely a deep-clearance GC will run (with a likely brief pause-the-world). The older CMS algorithm tended to avoid deep clearance until forced by lack of free memory, and it would produce horrendous pauses. G1GC gets a lot more done in its shallow clearances.
Java memory usage is naturally a sawtooth under all algorithms. With G1GC, I recommend, after monitoring, setting max memory so that the top of the sawtooth is between 80% and 90% of max, and the trough of the sawtooth is less than 50% of max.
So, again, after switching to G1GC, we saw that jump up from 1GB-4GB. It wasnāt a slow drift, it was a very clear and discrete change between regimes.
We started from a very āhairyā garbage collection happening frequently, about once a minute, overlaid on a longer 20 hour cycle that recycled at about 1GB. Then all of a sudden, to a slower - as Kevin noticed, 10 collections in 4 hours - collection that allowed garbage to pile up to several gigs, and was not superimposed on any slower fluctuation in memory use. 4GB is not as much use as when we switched, but thereās of course no reason to assumed 4GB is some natural maximum, or that 0.4hr is some natural frequency. If it runs slower next time, weāll pile more garbage, and in theory, no ceiling is high enough.
Itās that change in behavior, that came on suddenly, and persisted until a reboot that is worrying. Iād very much like to peg us at some 120% of sawtooth peak, but if I had done so immediately after switching to G1GC, I would have pegged us at 1.2GB, and we would have been absolutely murdered with reset the worlds when the algorithm went through that behavior change.
I don't think so. During the period where memory usage was peaking around 4G, it was deep-cleaning without apparent problem. A lower max memory would have made the garbage collector deep-clean more frequently, with less to do on each. With correspondingly shorter pause-the-world. Based on your chart, you should give Ignition about 1.5G.
Ok, so if I understand, the program using all of the memory assigned to it is only a problem if that memory is a significant amount of total system resources. Bumping up against a low ceiling should be fine.
One new fact weāve discovered is that the rapid onset of higher mem use and less frequent collection is related to losing connection to our main OPC Server. We see all tags served by it as red āxā marked. However on the gateway the server is not listed as āfaultedā but as āconnectingā permanently. Disabling and re-enabling the connection instantly re-establishes the connection, but in so doing, CPU and memory useage spike. CPU useage hits about 60% during the period when the 4GB or so of garbage is being cleared, but hovers around 20-30% just generally.
If after restarting the connection, we restart the gateway, CPU useage sits around 4% always.
Is any of this suggestive? I took a threaddump from the Gateway Control Utility before cycling the gateway, but Iām not sure Iām competent to parse the output.
This seems like a different problem that Iād rather troubleshoot in another thread, if not via support.
What version of Ignition are you using and what is the āmainā OPC UA server are you connected to?
Itās not weird that you see a spike of CPU and memory churn when it reconnects because thereās a bunch of browsing and subscription re-establishment happening, so youāve momentarily left the āsteady stateā.
What is weird is that it doesnāt reconnect on its own and you have to disable/enable it.
Itās not weird that you see a spike of CPU and memory churn when it reconnects because thereās a bunch of browsing and subscription re-establishment happening, so youāve momentarily left the āsteady stateā.
Disabling and re-enabling the connection instantly re-establishes the connection, but in so doing, CPU and memory useage spike. CPU useage hits about 60% during the period when the 4GB or so of garbage is being cleared, but hovers around 20-30% just generally.
When you say āLet it go for awhileā you did notice that the high-mem-use regime started on 8-14-2019 and ended only when we rebooted on 8-24-2019. Various clients, alarms, designer sessions were started and ended during this period. How long is āa whileā if 10 days doesnāt cut it?
The problem persists arbitrarily long. We have not observed it to go away on its own without a reboot. It's not 'normal' churn
Iāve seen similar memory usage trends on at least two installations (one running 7.9.9, one at 7.9.10) that run at some baseline with the normal sawtooth trend and then that baseline jumps up to 300% or 400% of what it was and it runs at the new baseline until the gateway server is restarted. Both are running Windows Server 2016 and are using the G1GC garbage collector.
Weād tried updating to version 8, but found several of our projects simply broke. Given that even the fonts werenāt displaying correctly, we decided to wait out a few sub-versions and let the bugs get fixed before trying to sort out what issues were things we needed to address.
Also a good guess. We are indeed using the G1GC and a windows based system. Iāll have to find out whether we are using Windows Server 2016. Do you mean on the Gateway computer, or on the OPC Server that is correlated with starting this behavior? I know that that particular server is serving specifically tags from a GE PLC network we have on site. Iām not sure if itās running any particular server software besides what it comes packaged with.