What's the best way to determine the cause of high CPU and memory usage?
Thank you.
Ignition v 8.1.44
Ubuntu 22.04.5 LTS
What's the best way to determine the cause of high CPU and memory usage?
Thank you.
Ignition v 8.1.44
Ubuntu 22.04.5 LTS
After the fact, with no prep, there's nothing to see.
I recommend running a timer event that uses java's MX bean infra to check CPU and memory usage every second, and grab a thread dump any time either is greater than some value ~95%.
You could also run a continuous profiler, but those tend to be fairly heavy-weight.
Edit: Here's what I use to take heap dumps in crises. You can use as is, or perhaps alter to only do thread dumps:
# Copyright 2023 Automation Professionals, LLC <sales@automation-pros.com>
#
# Redistribution and use in source and binary forms, with or without modification,
# are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice,
# this list of conditions and the following disclaimer.
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
# 3. The name of the author may not be used to endorse or promote products
# derived from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED
# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
# MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
# SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT
# OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
# IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
# OF SUCH DAMAGE.
from java.lang.management import ManagementFactory, MemoryMXBean
from com.sun.management import HotSpotDiagnosticMXBean
logger = system.util.getLogger(system.util.getProjectName() + '.' + __name__)
mbServer = ManagementFactory.getPlatformMBeanServer()
diagMxBean = ManagementFactory.newPlatformMXBeanProxy(mbServer, "com.sun.management:type=HotSpotDiagnostic", HotSpotDiagnosticMXBean)
memMxBean = ManagementFactory.getMemoryMXBean()
def heapUsedNow():
usage = memMxBean.getHeapMemoryUsage()
if usage.max > 0:
return 1.0 * usage.used / usage.max
return 0.0
#
#
priorDumpTagPath = '[default]Diagnostics/HeapDumpInhibitTS'
dumpThresholdTagPath = '[default]Diagnostics/HeapDumpThreshold'
def checkMemForDump():
now = system.date.now()
inhibitTS, threshold = [x.value for x in system.tag.readBlocking([priorDumpTagPath, dumpThresholdTagPath])]
if now.after(inhibitTS):
# dump is allowed
if heapUsedNow() > threshold:
system.tag.writeBlocking([priorDumpTagPath], [system.date.addSeconds(now, 30)])
fnBase = '/usr/share/ignition/diag_dump_' + system.date.format(now, 'yyyyMMdd_HHmmss')
diagMxBean.dumpHeap(fnBase + '.hprof', False)
system.file.writeFile(fnBase + '.threads', system.util.threadDump())
#
#
# kate: tab-width 4; indent-width 4; tab-indents on; dynamic-word-wrap off; indent-mode python; line-numbers on;
#
That will likely come in handy down the road. Thanks Phil!
I should mention that taking a heap dump is itself extremely heavyweight, and produces a JVM stall all by itself. Don't enable the timer event that calls that check function unless you are actively tracking a problem. Think of it as exploratory surgery.
Noted. Thanks again.
I saved a link to this post in case someone calls me in to troubleshoot this kind of problem. It happens from time to time.
Starting with version 8.1.13, automatic thread dumps can also be enabled under Gateway Settings.
Those help for pure performance problems, but performance problems accompanying memory exhaustion are nigh-impossible to figure out without a heap dump.
Agreed! I probably should have clarified that it would only help troubleshoot high CPU usage.
I'll also point out for posterity that if you set a flag prior to launching the gateway, you can capture heap dumps in a Diagnostics Bundle:
Doesn't help with an immediate problem diagnosis, but can be a useful thing with a known-problematic system.
Interesting. Thanks for sharing!
I'm curious how this method differs from the [System]Gateway/Performance/CPU Usage
(& similar) tags , and if negligible, whether you would recommend the Gateway Timer Script over a similar Gateway Tag Change script monitoring the aforementioned tag(s)?
Thank you @pturmel , much appreciated.
The MX bean is where the tag gets its value. So there's more overhead in the tag subsystem. I'd not use a tag event for this. Just use a timer.
Noted. My script can be deployed without a gateway restart, which can be a problem for production sites.