I've got a gateway that our company is maintaining for a customer, and we are trying to track down a performance issue and I need some advice on where to go.
This is a gateway that is a perspective and webdev system. This gateway is a central gateway in the customer's data center that is communicating to multiple remote smaller sites via the gateway network. Some data is being synced across the gateway network using gateway timer scripts.
The issue is that every so often the CPU usage in Ignition will spike, memory will spike and ClockDriftDetector warnings will occur. Usually, the GC will be able to drop that memory down, however, in extreme situations, the java heap seems to fill up and then things start to crash.
I'm trying to isolate the cause, or more specifically prove where it isn't.
For instance, this is a cut from the performance page:
This gateway is installed on Windows Server 2016, in a Hyper-V environment. I don't have access to the Hyper-V Host side of things so I can't verify that end. I've been told that all the resources are properly allocated and it isn't overcommitted.
The only software on the guest is Ignition, Windows and Carbon Black Cloud Sensor Anti-virus.
Currently, the java heap is set for 8GB.
Where do I start looking into things to try and start to get this working better? Is this a simple memory issue? Or possibly something else?
Bit of an odd memory usage pattern. Typical is a sawtooth. A healthy sawtooth tends to ramp from 40%-ish (or less) to 80%-ish, then GC knocks it back down.
Except. Long time spans for requested graphs, especially in the reporting module, randomly spike memory. A couple or more such requests together could push a server over the edge. Perspective also loads a server much more than Vision, so a bunch of busy clients can do similar.
Consider this similar topic:
Maybe also set up GC detail logging in ignition.conf. See how long your pauses usually are.
Anything in particular I'm looking for the heap dumps? I finally got my local VM working correctly and VisualVM is working. I can dump the heaps and take a look locally. This isn't the same as the problem child, but at least I can get a bit of familiarity with what I'm seeing. And I can get familiar with the process as well.