[BUG] ResourceLeakDetector Exceptions after running system.opc.browse

nminchin · May 21, 2021, 2:22pm

v8.1.5

I just tried to run system.opc.browse with a folder filter, no opcitempath filter, no wildcards, and at about the same time I noticed that the gateway CPU spiked from 12% and it has been at 80% or there abouts now for about an 1.5hrs.
I took a flight recording and I’m seeing 131,000 of these exception errors within the 5min recording i took:
io.netty.util.ResourceLeakDetector$TraceRecord

This is the associated stack trace:

I’ve already contact support, however they don’t start yet… Any idea how to fix this without a restart? Restart of OPC-UA module?

Edit: A reboot of the OPC-UA module seems to have fixed it… I think I found a bug / memory leak

Kevin.Herron · May 21, 2021, 2:36pm

Do you have logs with exception stack traces that indicate a leak?

nminchin · May 21, 2021, 2:42pm

When you say ‘logs’, what in particular?
I have:

flight recordings (3x5mins [~160MB] and 2x1Hr [1-1.5GB]) - set to record all exception, errors and otherwise
wrapper log
lots and lots of thread dumps via the scripting system.util.threadDump (or something) call. (I setup auto-dumping and emailing if the CPU gets above 65%)

Kevin.Herron · May 21, 2021, 2:44pm

In the wrapper log you would see errors like this:

ERROR i.n.u.ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. Enable advanced leak reporting to find out where the leak occurred. To enable advanced leak reporting, specify the JVM option '-Dio.netty.leakDetectionLevel=advanced' or call ResourceLeakDetector.setLevel()

If you don’t see those then all your are observing is the Netty ByteBuf leak detection machinery in action.

system.opc.browse is incredibly dangerous because it does a fully recursive browse that can’t be terminated. I will always recommend never using it and instead using its replacement, system.opc.browseServer, which requires that you drive any recursion yourself with subsequent calls.

nminchin · May 21, 2021, 2:45pm

INFO   | jvm 1    | 2021/05/21 23:41:55 | 23:41:54,993 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating DEBUG level on Logger[io.netty.util.ResourceLeakDetector] onto the JUL framework
INFO   | jvm 1    | 2021/05/21 23:42:16 | 23:42:16,405 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating TRACE level on Logger[io.netty.util.ResourceLeakDetector] onto the JUL framework
INFO   | jvm 1    | 2021/05/21 23:42:37 | 23:42:37,633 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating INFO level on Logger[Scripts-shared] onto the JUL framework
INFO   | jvm 1    | 2021/05/21 23:42:37 | 23:42:37,633 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating INFO level on Logger[Scripts-shared.alarms] onto the JUL framework
INFO   | jvm 1    | 2021/05/21 23:42:37 | 23:42:37,633 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating INFO level on Logger[Scripts-shared.winedelivery] onto the JUL framework
INFO   | jvm 1    | 2021/05/21 23:42:37 | 23:42:37,633 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating INFO level on Logger[Scripts-shared.winedelivery.cip] onto the JUL framework
INFO   | jvm 1    | 2021/05/21 23:42:37 | 23:42:37,633 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating INFO level on Logger[drivers.MicroLogixDriver.ABMicroLogixWriteRequest] onto the JUL framework
INFO   | jvm 1    | 2021/05/21 23:42:37 | 23:42:37,633 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating INFO level on Logger[io.netty.util.ResourceLeakDetector] onto the JUL framework
INFO   | jvm 1    | 2021/05/21 23:42:37 | 23:42:37,633 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating INFO level on Logger[org.apache.wicket.markup.html.form.PasswordTextField] onto the JUL framework
INFO   | jvm 1    | 2021/05/21 23:42:37 | 23:42:37,633 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@6ff79f8d - Propagating INFO level on Logger[perspective.binding.ExpressionStructure] onto the JUL framework

Might be good to have a warning in the user manual for dummies like me
We have 72 connected devices, might explain the holdup...

Memory is still up around 83%, up from previous max 55%... waiting for the garbage collector to kick in... any time now

Kevin.Herron · May 21, 2021, 3:07pm

It's entirely possible the UA module does not restart without leaking memory. A few years ago I reported some kind of ClassLoader leak to Netty but I don't know if they ever fixed it. But I think it was a very small amount of memory leaked on each module restart... so might not be relevant here.

nminchin · May 24, 2021, 12:18am

Definitely looks like it's leaked memory. I checked the graphs after the weekend (issue occured on Friday night - see section of graph 2/5th along when CPU spiked) and new memory graph has move up:

To fix, I guess i'll need to do a gateway reboot?