Memory Leak / Heap dump

niels.casier · August 10, 2021, 9:21am

Hello,

I am getting some issues with a Memory Leak that appears. I’m in contact with support and I’ve given them a heap dump of when the application is running about 10 minutes.
I want to take another heap dump, but when I do (and the heap is > 5GB), the heapdump fails and the application restarts.
Here you can see the command that I am using

I always get follwing result:

I was wondering if someone experienced this before? I’m sure we’ll figure something out with support, but because of time difference I would like to see if anyone else has got a suggestion.

Thanks in advance!

bkarabinchak.psi · August 10, 2021, 12:40pm

Do you have any gateway timer scripts that run every ten or less minutes? Does this happens when just the gateway is running stand alone without any clients running? Just to make sure that this is not something influenced by clients interacting with the gateway.

niels.casier · August 10, 2021, 2:06pm

We do have some gateway scripts running. And they are running every 10 or less minutes. Also, the gateway is running with clients. Could this be the reason that it fails when the heap is too high?
Just before I take the heap dump, I switch this gateway with the back-up gateway. So I’m not sure if the gateway scripts are running then. Also, all clients should be transfered to the back-up gateway

bkarabinchak.psi · August 10, 2021, 2:12pm

What sort of functionality do these gateway timer scripts do? Some culprits I experienced that could lead to issues

you might have a system.db.runQuery("SELECT * FROM table") without a limit clause on a table that has million of rows
your script might take so long that the next instance starts before your current one finishes

The clients - do they interact with the gateway much? Do they system.util.sendMessage to the gateway to tell it to do things, either on a schedule or via user interaction? You can check your gateway message handlers to see if its possible, and then you can check for what I just mentioned in those message handlers.

Are you able to take a backup of the gateway, put it on a development machine, and run it, without any clients ,and see if it still crashes after 10 or so minutes? This would at least help cull down on what could be causing it - you’d know it has nothing to do with client interactions.

Also I don’t know why i forgot to mention this, but do you see anything in your server logs? Look at the minute or two right before the crash.

niels.casier · August 10, 2021, 2:19pm

We use everything you said. We use the system.db.runQuery without the limit clause. It doesn’t happen a lot but it is possible.
And yes, we do use system.util.sendMessage. This has caused a lot of serious problems for us but it is fixed now. I’ll have to wait a few more days before a crash is going to happen, but I’ll keep an eye for it.
With all this being said, it’s still unclear why the heapdump would fail. I just took another one where the heap was bigger than before, and no problems with that so far. Only when it gets to high, the heap dump fails.

bkarabinchak.psi · August 10, 2021, 2:22pm

Yea, I would say try to do your system.db.runQuery in batches of say 5000 rows at a time, instead of all at once, and if that doesn’t fix your problem, definitely call IA.

Kevin.Herron · August 10, 2021, 3:13pm

Do the heap dumps fail immediately or hang for, say, 30 seconds and then fail?

niels.casier · August 10, 2021, 4:09pm

They fail after some time. 30 seconds sounds something like that

Kevin.Herron · August 10, 2021, 4:12pm

Try adding this to your ignition.conf file and restarting:

wrapper.ping.timeout=0

niels.casier · August 11, 2021, 8:48pm

I’ve just added the wrapper.ping.timeout=0 to the conf file and restarted the system.
I guess I can try this again tomorrow. Mostly the application reaches approximately 5GB after 8 hours or something like that.
Let’s hope it works
Thanks for the advice and I’ll keep you posted.

BR

niels.casier · August 12, 2021, 7:07pm

So I’ve added the wrapper.ping and tried to take a heap dump.
Looks like this worked! Thank you very much.
This has been a pain since quite some time now. I’ll give this to support. Hope we can find the leak very quickly now

BR,
Niels