I just tried to run system.opc.browse with a folder filter, no opcitempath filter, no wildcards, and at about the same time I noticed that the gateway CPU spiked from 12% and it has been at 80% or there abouts now for about an 1.5hrs.
I took a flight recording and I’m seeing 131,000 of these exception errors within the 5min recording i took: io.netty.util.ResourceLeakDetector$TraceRecord
flight recordings (3x5mins [~160MB] and 2x1Hr [1-1.5GB]) - set to record all exception, errors and otherwise
wrapper log
lots and lots of thread dumps via the scripting system.util.threadDump (or something) call. (I setup auto-dumping and emailing if the CPU gets above 65%)
In the wrapper log you would see errors like this:
ERROR i.n.u.ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. Enable advanced leak reporting to find out where the leak occurred. To enable advanced leak reporting, specify the JVM option '-Dio.netty.leakDetectionLevel=advanced' or call ResourceLeakDetector.setLevel()
If you don’t see those then all your are observing is the Netty ByteBuf leak detection machinery in action.
system.opc.browse is incredibly dangerous because it does a fully recursive browse that can’t be terminated. I will always recommend never using it and instead using its replacement, system.opc.browseServer, which requires that you drive any recursion yourself with subsequent calls.
INFO | jvm 1 | 2021/05/21 23:41:55 | 23:41:54,993 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating DEBUG level on Logger[io.netty.util.ResourceLeakDetector] onto the JUL framework
INFO | jvm 1 | 2021/05/21 23:42:16 | 23:42:16,405 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating TRACE level on Logger[io.netty.util.ResourceLeakDetector] onto the JUL framework
INFO | jvm 1 | 2021/05/21 23:42:37 | 23:42:37,633 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating INFO level on Logger[Scripts-shared] onto the JUL framework
INFO | jvm 1 | 2021/05/21 23:42:37 | 23:42:37,633 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating INFO level on Logger[Scripts-shared.alarms] onto the JUL framework
INFO | jvm 1 | 2021/05/21 23:42:37 | 23:42:37,633 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating INFO level on Logger[Scripts-shared.winedelivery] onto the JUL framework
INFO | jvm 1 | 2021/05/21 23:42:37 | 23:42:37,633 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating INFO level on Logger[Scripts-shared.winedelivery.cip] onto the JUL framework
INFO | jvm 1 | 2021/05/21 23:42:37 | 23:42:37,633 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating INFO level on Logger[drivers.MicroLogixDriver.ABMicroLogixWriteRequest] onto the JUL framework
INFO | jvm 1 | 2021/05/21 23:42:37 | 23:42:37,633 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating INFO level on Logger[io.netty.util.ResourceLeakDetector] onto the JUL framework
INFO | jvm 1 | 2021/05/21 23:42:37 | 23:42:37,633 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating INFO level on Logger[org.apache.wicket.markup.html.form.PasswordTextField] onto the JUL framework
INFO | jvm 1 | 2021/05/21 23:42:37 | 23:42:37,633 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating INFO level on Logger[perspective.binding.ExpressionStructure] onto the JUL framework
Might be good to have a warning in the user manual for dummies like me
We have 72 connected devices, might explain the holdup...
Memory is still up around 83%, up from previous max 55%... waiting for the garbage collector to kick in... any time now
It's entirely possible the UA module does not restart without leaking memory. A few years ago I reported some kind of ClassLoader leak to Netty but I don't know if they ever fixed it. But I think it was a very small amount of memory leaked on each module restart... so might not be relevant here.
Definitely looks like it's leaked memory. I checked the graphs after the weekend (issue occured on Friday night - see section of graph 2/5th along when CPU spiked) and new memory graph has move up: