In-memory Caching, Server Memory Grows Without Bound

cmaynes · September 25, 2020, 6:42pm

Version: 7.9.6

I am using a decorate function to cache query data that rarely changes. I have already checked to make sure there is no leak with my decorator/caching system using the Python pympler package outside Ignition. I am not leaving references around after removing something from the cache. Also, I can run a client for days without any issues, I just see a normal buildup and release without freezing or slowdowns.

The same decorator is turned on on the gateway, since I have a WebDev module with a service that needs the same data repeatedly. I am not looking for alternative ways of caching. My problem is that the server non-heap memory grows without bound and eventually I have to restart the server. Even restarting the server proves difficult as it either never shuts down using the GCU or takes a very long time.

I am not too familiar with Java. Does anyone have any suggestions on how to troubleshoot this?

Update:

Found it is something to do with saving a project. During a development cycle where I was frequently saving to see the changes reflected in an open client, I saw a clear uptick in Non-heap memory usage.

Kevin.Herron · September 25, 2020, 8:10pm

So without the decorator in use on the gateway you don’t see the same growth? That screenshot hardly looks like runaway growth… what are the numbers that represents?

cmaynes · September 25, 2020, 9:57pm

Yes and the decorator is absolutely required or else I will flood with database with calls for things like Users, Constants. etc.
3 GB of Non-Heap and 6 GB of Heap.
The entire VM instance crashed at once point and had to be restarted because the memory reached 28 GB, the max allocated to the VM.

Kevin.Herron · September 25, 2020, 10:10pm

Hmm. Well I have no idea what your decorator implementation would cause non-heap growth. Maybe something to do with Jython compiling functions to byte code and leaking… but not sure why it wouldn’t happen when you don’t use the decorators.

cmaynes · September 26, 2020, 1:08am

Trying to catch the system in the act and it appears to happen when the gateway restarts following saving/publishing a project. More memory seems to be consumed when changes to gateway scripts are made, which is also where the bulk of the codebase is. Is it possible to old version is not being discarded so that the cache stored from the previous version sticks around in memory?

Kevin.Herron · September 26, 2020, 1:16am

Maybe, but if the old cache is sticking around it would be heap memory.

You may need to hook up a tool like JProfiler or VisualVM and see if you can get anywhere. One thing worth looking at is if the G1 Metaspace memory usage just keeps increasing.

cmaynes · September 26, 2020, 5:28pm

I will try this, thanks

cmaynes · October 1, 2020, 7:14pm

I tried JProfiler and I did end up finding/realizing a problem & finding a solution.

Running JProfiler on the server where Ignition is running and hammering the save button in a designer, I found a large number of classes were being created and never released. Looking in Heap Walker > Bigger Objects, I could see all the memory was being allocated to org.python.core.PySystemState$PySystemStateCloser. From there the only thing I could guess was that the caching function was somehow conflicting with the proper removal of cached data. Turning off caching and simply returning the function results every time, restarting the Ignition server, and then repeating the same process as before proved this theory out.

My implementation of the LRU cache is nearly identical to the one in the standard library in Python 3, functools.lru_cache. In that design and pretty much every other design involving a doubly-linked list in Python you can find, the nodes do reference one another and do create a circular reference. I believe it is not a problem in Python 3, CPython, because with the LRU cache you will dump the old unused cached values and the cache will never grow to anything significant.

Normally in CPython you would just kill the interpreter process and that would be that. On the Ignition server, there were many instances of PySystemStateCloser, with each consuming ~6 MB. After the fix to the caching function, I am down to just 1. It appears whatever Ignition/Jython does to when restarting the gateway interpreter following a project save cannot handle this circular reference problem and so the memory remains allocated until Ignition is restarted.

To actually fix the caching function, I did the following:

Store the hash of the function parameters as the key instead of the parameters themselves. This fixed a problem when the decorator was used on classes, where self is passed as a param and the class references the decorator so you end up in a circular reference.
Use weakref.ref to create “links” between the nodes, removing the circular ref between the nodes themselves.

Update:

Weakrefs were still a problem and still stuck around forever, it just was less of a problem so it appeared to have worked. Changing to using the hash values in each node fixed that problem. Just requires a dict lookup to get adjacent nodes.

pturmel · October 1, 2020, 7:32pm

You may not want to use weakref. You will likely see cached objects dropping out and breaking your linked list. Consider using system.util.getGlobals() in v7.9 or v8.1 to manage persistence (my LifeCycle module in v8.0) with explicit discard of any old objects when your caching script module loads.

cmaynes · October 1, 2020, 7:45pm

My explanation may not have been clear, but the dictionary in the decorator stores the hash as the dict key and the value is a hard ref to the node. So long as the dict entry persists, so will the node.

As for using the global dict, that would be good in some cases but I prefer to wipe out the cache rather than try to persist. Also, the problem with trying to maintain the cache is that I would need to get a hash of a function, but usually an update is pushed because some part of the code changed and I would want the change to take effect and have the cache cleared out. From the rest of what you said about discarding old objects on script module loading, that can be accomplished with what I have now with a dict in the decorator.

pturmel · October 1, 2020, 8:50pm

I didn’t mean you should persist the cache contents, the opposite in fact. Just that getGlobals() is a place where you can stash a reference to your cache that can be read during script module initialization, to easily obtain access to the prior version for explicit clearing.

cmaynes · October 8, 2020, 11:00pm

I was wrong, I still have a problem. The problem is not nearly as bad as when I started, but no matter what I do PySystemState instances keep hanging around. I could deal with restarting the Ignition service every so often, but Ignition service refuses to close and instead hangs for a long period of time. I cannot find the same problem anywhere online for Jython, so maybe this is specific to Ignition? Or maybe this is because Jython is not that popular and those who are using are using 2.7, which may not have this problem.

pturmel · October 9, 2020, 2:08am

You’re probably going to have to do some profiling. I’d start with jmap to get a heap snapshot. You might need to log the java temporary classnames for any python objects that extend java classes/interfaces.

cmaynes · October 9, 2020, 2:35pm

I have tried profiling, as suggested by @Kevin.Herron but I cannot figure out the problem. It just looks like the codebase is being held onto. When I had the old cache design, I could clearly see a massive dictionary of the same design as the one i had for the cache but that is not the case here.

Here are the largest objects from a heap dump following the saving in the designer.

Heap_walker_Biggest_objects_2.html (54.4 KB) Heap_Walker_Classes.html (2.0 MB)

pturmel · October 9, 2020, 3:14pm

You are going to have to study what is owning the python objects in those PySystemState instances. Somewhere your code is registering a java listener or doing something that is making Ignition’s core code hold a reference into that old code. Its not the size of the heap items that matters so much. Pick one of those instances and build a map of all the ownership, discarding owners within that instance.