High CPU and Memory Usage

edward.fowler · January 16, 2025, 11:16am

What's the best way to determine the cause of high CPU and memory usage?
Thank you.

Ignition v 8.1.44
Ubuntu 22.04.5 LTS

pturmel · January 16, 2025, 1:26pm

After the fact, with no prep, there's nothing to see.

I recommend running a timer event that uses java's MX bean infra to check CPU and memory usage every second, and grab a thread dump any time either is greater than some value ~95%.

You could also run a continuous profiler, but those tend to be fairly heavy-weight.

Edit: Here's what I use to take heap dumps in crises. You can use as is, or perhaps alter to only do thread dumps:


# Copyright 2023 Automation Professionals, LLC <sales@automation-pros.com>
#
# Redistribution and use in source and binary forms, with or without modification,
# are permitted provided that the following conditions are met:
#
#   1. Redistributions of source code must retain the above copyright notice,
#      this list of conditions and the following disclaimer.
#   2. Redistributions in binary form must reproduce the above copyright notice,
#      this list of conditions and the following disclaimer in the documentation
#      and/or other materials provided with the distribution.
#   3. The name of the author may not be used to endorse or promote products
#      derived from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED
# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
# SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT
# OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
# IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
# OF SUCH DAMAGE.

from java.lang.management import ManagementFactory, MemoryMXBean
from com.sun.management import HotSpotDiagnosticMXBean

logger = system.util.getLogger(system.util.getProjectName() + '.' + __name__)

mbServer = ManagementFactory.getPlatformMBeanServer()

diagMxBean = ManagementFactory.newPlatformMXBeanProxy(mbServer, "com.sun.management:type=HotSpotDiagnostic", HotSpotDiagnosticMXBean)

memMxBean = ManagementFactory.getMemoryMXBean()

def heapUsedNow():
	usage = memMxBean.getHeapMemoryUsage()
	if usage.max > 0:
		return 1.0 * usage.used / usage.max
	return 0.0
#
#
priorDumpTagPath = '[default]Diagnostics/HeapDumpInhibitTS'
dumpThresholdTagPath = '[default]Diagnostics/HeapDumpThreshold'
def checkMemForDump():
	now = system.date.now()
	inhibitTS, threshold = [x.value for x in system.tag.readBlocking([priorDumpTagPath, dumpThresholdTagPath])]
	if now.after(inhibitTS):
		# dump is allowed
		if heapUsedNow() > threshold:
			system.tag.writeBlocking([priorDumpTagPath], [system.date.addSeconds(now, 30)])
			fnBase = '/usr/share/ignition/diag_dump_' + system.date.format(now, 'yyyyMMdd_HHmmss')
			diagMxBean.dumpHeap(fnBase + '.hprof', False)
			system.file.writeFile(fnBase + '.threads', system.util.threadDump())
#

#
# kate: tab-width 4; indent-width 4; tab-indents on; dynamic-word-wrap off; indent-mode python; line-numbers on;
#

Steve_Laubach · January 16, 2025, 2:32pm

That will likely come in handy down the road. Thanks Phil!

pturmel · January 16, 2025, 3:18pm

I should mention that taking a heap dump is itself extremely heavyweight, and produces a JVM stall all by itself. Don't enable the timer event that calls that check function unless you are actively tracking a problem. Think of it as exploratory surgery.

Steve_Laubach · January 16, 2025, 3:39pm

Noted. Thanks again.

I saved a link to this post in case someone calls me in to troubleshoot this kind of problem. It happens from time to time.

avaughn · January 16, 2025, 3:59pm

Starting with version 8.1.13, automatic thread dumps can also be enabled under Gateway Settings.

pturmel · January 16, 2025, 4:06pm

Those help for pure performance problems, but performance problems accompanying memory exhaustion are nigh-impossible to figure out without a heap dump.

avaughn · January 16, 2025, 4:09pm

Agreed! I probably should have clarified that it would only help troubleshoot high CPU usage.

PGriffith · January 16, 2025, 4:28pm

I'll also point out for posterity that if you set a flag prior to launching the gateway, you can capture heap dumps in a Diagnostics Bundle:

Doesn't help with an immediate problem diagnosis, but can be a useful thing with a known-problematic system.

Chris_Bingham · January 16, 2025, 4:40pm

Interesting. Thanks for sharing!

I'm curious how this method differs from the [System]Gateway/Performance/CPU Usage (& similar) tags , and if negligible, whether you would recommend the Gateway Timer Script over a similar Gateway Tag Change script monitoring the aforementioned tag(s)?

edward.fowler · January 16, 2025, 4:40pm

Thank you @pturmel , much appreciated.

pturmel · January 16, 2025, 4:54pm

The MX bean is where the tag gets its value. So there's more overhead in the tag subsystem. I'd not use a tag event for this. Just use a timer.

pturmel · January 16, 2025, 4:56pm

Noted. My script can be deployed without a gateway restart, which can be a problem for production sites.