Using jmap to diagnose heap related issues in Ignition?

I’ve been searching for the cause of heap consumption for a while now. Switched to G1GC garbage collection at the suggestion of some people. Used the “del” operation on some objects to help with garbage collection and Ignition is still trending toward consuming available heap over time. So I’ve seen jmap mentioned numerous times for diagnosing heap related issues. I now have the JDK installed on my server but but no matter what PID I target with jmap is get an error. How do you determine what the target PID is when running Ignition? Does jmap have to “Run As Administrator”? I’ve seen notes that jps would help identify the items of interest but I get nothing from that either. Maybe the same “Run As Administrator” issue?

Can you expand on that a bit? Over what timespan? What’s your max memory set to? If memory usage is a “sawtooth” pattern - that’s totally normal, and expected. If the trough of those sawtooths is getting higher over time, that could be a memory leak, but even then, unless you’re actually running out of memory, you may be spending unnecessary effort to diagnose perfectly normal behavior.

I put the G1GC garbage collection into effect last Tuesday the 11th. Since that time I’ve seen the heap usage rise from the initial .5 GB level to just over 1 GB level. The server is currently configured with a maximum of 2 GB. Prior to the G1GC garbage collection I would see the heap depleted in about 2 weeks resulting in a restart. This appears to be headed in the same direction but it may take longer to reach the critical point. The system has not yet started seeing the “Pause-the-world” events that were typically seen toward the point of no return.

Okay, so if you do have a memory leak, it’s more likely a bug within Ignition than anything you’re doing (or something you can do anything about) so I’d recommend getting in contact with support directly. You may want to see if you really do encounter OOM errors in the first place, though. Also, what version of Ignition are you using?

I’m assuming that the OOM condition will result in the restart that we do see. I’ve not dug through the logs far enough to actually see a declaration of OOM though.

Actually Paul, I did get into the log files yesterday. I never see the OOM message. What does happen is the ClockDriftDetection begins to report longer and longer deviations, 5000 to 6000 ms, with one extending to 20+ seconds. The connection between the primary and backup servers is deemed to be down. There appears to be a brief failover, and restart on the original primary, and then it fails back to the primary and everything recovers and we start the process all over again. I’m assuming the drift deviations are a direct result of the heap issue and efforts to recover it. I did open a ticket with support yesterday afternoon but haven’t seen a response yet. Thanks for you help on this.

It does sound like your application has a memory leak. I’d work on getting jps and jmap working while you wait for support to get back to you. If you emailed it could be a day or two depending on the call volume.

Well, that’s kind of where I started. I have JDK on a server now but I’m not having any luck getting any information. I’m not familiar with the use of these tools and how to direct them at the appropriate Ignition components to gather the information I need.

Try running them from a console started as administrator. I don’t know why else they wouldn’t work and I’m not really familiar enough with Windows to have any other guesses.

FWIW, when you run jps Ignition may show up as “WrapperSimpleApp”.

When I initiate jps from the command line on the Windows Server what I get appears to be the pid of jps and nothing else.

image

Hmm. Probably something to do with Windows and Ignition running as a service :confused:

I wonder if you’d have better luck with a tool like VisualVM

This thread suggests some potential workarounds… I guess it is a bit more difficult on Windows.

1 Like

I went and looked at the suggested thread. Seems like an entire project just to get java utilities to run under windows. Thanks for the suggestion though. Haven’t gone down the patch of VisualVM. Not sure what it is and what is required to install it. The target system is in production and will not tolerate a restart so if a system restart is required I can’t go down that path.

Another issue I might be encountering is that my JDK is not the identical version of the runtime environment version. It’s actually a bit higher version.

I wanted to get back to you on your jmap suggestion. I did finally get a heap dump using jmap. It just took some experimentation. I was able to use netstat - anon to locate the PID I needed to give jmap. The key is that you need your windows cmd window to be running as administrator. I used the command jmap -dump:file=C:\your path here\heapdump.bin - to generate the dump file which takes a lot of disk space. Basically the same size as your heap. Using jhat to analyze the file also took some experimentation. It turns out that jhat require a HUGE heap space for the analysis. By adding the -J-mx16G on the jhat command to give it 16 GB of heap it was finally able to analyze the file and generate its output. For a 4 GB heap it took nearly 30 minutes to generate the results.

Thanks for pointing me in the right direction.

Still working through support to locate the heap consumer.

Kevin,

What you describe is exactly what we’re dealing with over here. Windows server, gateway slowly takes more and more memory - though sometimes we’ve seen rapid changes to a much higher use regime, that stays essentially static.

Were you able to locate the problem, and were you able to fix it, and how, in both cases?

Unfortunately no, we have not gotten to a conclusion at this time. I’ve had a case open with Ignition support since mid-June on this issue. At first we were directed to change our garbage collector to G1GC and to increase the heap from 2 GB to 4 GB. It did change the shape of the heap memory trend a bit but did nothing for the original problem. Increasing heap space did nothing but extended the period between gateway restarts.

I’ve gotten little else in response. I will occasionally get a response if I ping support asking for status as this has been ongoing now for 2 1/2 months.

I’ve sent multiple heap dumps from jmap.

Last contact I tried to see if there was a way to escalate the case due to its longevity. I was told there is no escalation process in place. I was also told that I would get better response if I called rather than exchanging email. I don’t understand that but keep it in mind if you open a case in the future.

I’m currently experimenting with the notion that if I get away from the Run Always SFC that possibly the garbage collector might be able to do a better job of cleanup. The jury is still out on that experiment as I’ve only had the new solution in operation for 4 days in our office and it takes longer than that do tell if there is an improvement.

I certainly would like to hear that somebody is actively investigating this issue.

My particular application is socket based. It is using the socket functionality of python/java. I’ve read a number of articles that seem to indicate that special handling is required for the byte buffers used by the java socket implementation. I was wondering if that might be where my problem lies. The heap dump show a huge allocation for java socket related classes.

That might explain support’s inability to help. If you are rolling your own sockets, you are responsible for all of the related object lifetimes, many of which require explicit close operations to clean up. (Every single socket opened should be managed by a single thread that guarantees closure with a try-finally construct.)

I have taken every action imaginable for destruction/disposing of these socket related items. The connection is being shutdown, closed, and deleted on every use.

Does this mean you are opening these connections often? Does the protocol require new connections often? Is there any reason you aren’t using a long-lived socket with an assigned thread? Are you using Netty and not explicitly releasing its ByteBufs?