Clock drift and Garbage Collection

We have a small system that is still under development that have issues with Clock Drift,
I looked into the general guide but that didn't help much,

If the gateway is restarted it works fine during the day until it for some reason starts to consume memory and when GC gives the memory back is causes clock drift, at least that what it looks like to me.

Next step I presume would be to find out what is causing the memory allocation that increases suddenly,

What tool or procedure should i use to try to pinpoint what is consuming memory all of a sudden.

EDIT:
Some extra info:
It runs on a high end laptop at the moment, a windows 11 machine.
Exclusion have been done for anti virus with the recommendation from Inductive Automation.
Power settings is always on. Local admin account is used.

SQL express is installed on the computer.

The application is not used during the night, the computer is idle basically.
In the gateway under performance i see no scripts running.

Tested to download diagnostic package and backup of the gateway and dumped that into gpt 5.5 to analyse which didnt find anything particular out of the ordinary, except that the clock drift seems to be related to GC

I have now installed OpenJDK 17.0.17 and is running

.\jcmd.exe 10980 JFR.start name=heapcycle settings=profile duration=12h maxsize=4096m filename=C:\IgnitionLogs\heapcycle_12h.jfr

to collect information about heap.

This is actually something hard to do.

In our project, and talking from my experience, we had to "take apart" what was causing the memory leaks.
In our case it was a mix of poorly designed scripts and lots of redundant reads and writes to bd.

You should start to looking into your gateway. Check the status > Diagnostics section. There you can check the logs, the current scripts and how many threads are running.

Also, Ignition is designed to collect a lot of memory and then clean it all. It is normal to see the step-type graph you are seeing (but not normal the clock drift).

Are you running Ignition on a VM server?

No this runs on a high end windows 11 machine.

I'll update info in the topic

Not surprised. Windows 11 isn't a server operating system, why would it be set up for optimal service resource usage? There are good reasons why desktop OS' are not supposed to be used for servers. Cr*ppilot probably has higher resource priority than a JVM...

Aside from that, same hardware with a linux server OS and I would happily wager you would see none of these issues.

It will be moved to different hardware and windows IoT when we are done with the system. Don't mind using linux for the server but in this case we will not.

Anyway, need to figure out what is causing the issue for this one. Whatever is causing the heap to grow shouldn't be impossible to figure out.

Again, this is not a server OS, why are you running a server on it?

Heaps grow until they hit a 80% ish mark, then the JVM pauses, then clears out memory that is allocated to functions that are no longer part of execution. That is normal behaviour.

This is a small all in one solution of a HMI for a machine, so it will be running the gateway and client on the same machine. Some basic logging and indicators and a few controls. Installing windows 2025 server for that seems like a unnessesary cost considering the application. But if the gateway can't run properly in windows 11 we might have to reconsider.

The slow increasing of the sawtooth (garbage collection) is something I’ve never seen. For some reason you are slowly using more and more memory. The outright disappearance of the sawtooth indicates something is using memory and never letting it go. This looks like a memory leak which may or may not be an Ignition issue.

Is the database on a separate machine, or the same machine? Ignition and the Database really both need dedicated CPU and RAM resources. If they are not separate consider using a VM strategy to separate (you should consider this regardless if hyper-v is available with your MSFT license).

You can also try to isolate the issue by disabling devices, database connections, modules, and tags, individually, one slow step at a time.

I have encountered this situation before, but I'm not sure if it's the same problem as yours. Please check whether you have created a large number of HttpClient instances in your script and used a large number of value change triggers. Previously, I used to create HttpClient inside a def block. Later, I created a new script in the script library and moved the creation of HttpClient outside the def block, and also changed a large number of value change triggers to gateway tag change scripts.

This is just a guess without looking at your system, but I think you probably have some kind of loop in your system from chaining bindings or scripts.

I use this tool to try to weed stuff like that out, if that is indeed the case:

This will build logging for your IO scripts and bindings. If you turn them on, you can review the logs and determine what is being most frequently called. If it's too much, you can grab a copy of the logs and ask an LLM to help you analyze.

Does that traffic cop script wrap auditing around the calls themselves, or does it inject a library of wrapper functions for the same functions that includes that audit logging, and replace the original system calls with the new wrapped functions?

The latter would be what I would do, as it centralises the wrapping and reduces clutter everywhere it's used

The former, encased in easy-escaped try/except blocks. I'll take clutter to know whats going on and not risk breaking an entire project. Also, it's traced and removable, so that's pretty cool!

I just checked the code for this in particular; looks like you need to be careful not to add anything else into these injected try/excepts, otherwise the removal code will overwrite them to the originals.

I would still much rather have a standard set of wrapper libraries defining e.g. shared.tag.writeBlocking with exactly the same args as the originals, and replacing all references to the same system functions. Then you have control of what they do from the set of libraries, including turning logging on or off for all.

Great idea! You can fork this and update it to do that! Maybe keep this logic for bindings though? Not sure how you'd build a wrapper for that.

Not sure if this was facetious, but honestly I would just build this (the wrapper script libraries and calls to them) into your project and, if it's an existing project, do a bulk find/replace with VS Code :man_shrugging: There'll be some testing required obviously for an existing project, but if the args are identical in the wrapper functions and they return the same results, then there shouldn't be any issues. Famous last words! :grimacing:

Thank you, that is a good advice, I think I need to do it that way to get an idea to why this is happening. Unless I can read something out of the memory dump, I'll check that now.

The computer got 64 GB of ram and have an I9 Intel, there is a cap on how much sql express can use in memory and cpu so I don't think that is the issue, cpu and memory on the host server is basically 2% cpu and 40% utilized of memory.

And then suddenly for some reason it released the memory at 10:04 today.

Unfortunately I don't see why, nothing in the log and i was not logging heap memory either, but at least something happened it the right direction.

I'm no expert using the JDK Mission Control but with a bit of help with chatgpt it points towards the siemens enhanced driver from eariler log. I'll disable and enable it later when the memory increases back to see if anything happens.

For memory leaks I've found support to be the best help. They deal with it all the time.

I’ve never seen the exact behavior that you have, but I have seen erratic memory behavior when a Device is offline….then the behavior goes back to normal when the Device comes back online.

It could be something totally innocent looking like a script that creates an instances of an object (like the http client issue that @senthee_wu describes), then doesn’t handle a tag read very well, then faults out and the instance does not get cleaned up.

What version of Ignition and the version of the driver? Any custom settings in the driver configuration?