Gateways failed to restart after increasing Max/Min Heap?

Since a few days we’re experiencing slow performance on the ignition front-end, and designer also experiences the same slowness.

I’ve been looking at network and machine performance (hardware/vm) but can’t really find anything that stands out.

Current readings, on the gateways and respective redundancies:

VM CRGW01 10GHZ CPU

                            38GB OF 96GB  -  MEMORY

                            172GB IN USE  -  STORAGE

                            NO DISC QUEUING

                            6 vCPU READY 9-39   the time a vcpu has to wait before getting scheduled the requested resource, in milliseconds

                            NO DISC QUEUING

VM CRGW02 191 MHZ CPU

                            1.92GB  -  MEMORY

                            96GB  -  STORAGE

                            NO DISC QUEUING

                            6vCPU READY 9-23    the time a vcpu has to wait before getting scheduled the requested resource, in milliseconds

VM CRGW01 OS

            17% DISK OF 58GB

            84% OF MEMORY 96GB          (should be ok for a linux machine)

            11% CPU

            LOAD AVERAGE 8 8 8

VM CRGW02 OS

            15% DISK OF 58gB

            54% OF MEMORY 96GBS

            11% CPU

            LOAD AVERAGE 0.23 0.24 0.15        

We’ve already upgraded to 96GB per VM for the OS.

The GW mem usage in ignition itself is about 30GB, (at first reboot after running for 2 weeks it was around 47).

When trying to go for a higher java mem.pool we’ve had some weird issues that the service would not start.

max.heap=73728

init.heap=18432

SERVICE WONT START

max.heap=65554

init.heap=12288

SERVICE WONT START

max.heap=57344

init.heap=12288

SERVICE WONT START

max.heap=49152

min.heap=12288

No issues starting but it gets slow pretty fast.

I’ve been trying some different settings for the OS through the sysctl.conf file but no avail

If someone has any idea, or has experienced similar situations, it would be lovely to hear from you.

Regards,

My first though is that, if the designer is also being slow, that it points more to a design issue than a hardware issue.

My second thought is that the init.heap and max.heap should be that same. May as well allocate it right away.

3 Likes

Sounds like a memory leak issue more than a resource limit. Also, if you’re using that much RAM efficiently (you probably are not) I would think you would need an appropriate number of cores.

As soon as you run out of memory, your overhead turns into context switching to the backlog of processes you’re trying to create.

I would focus on anything in the gateway that is running Asynchronous processes, one or more is probably firing a new thread before the original is done.

1 Like

Are those servers running other functions? Like your database? If so, don't do that. Also, are the hypervisors equipped with enough physical RAM for all of the VMs they hold? (In other words, are your hypervisors over-committed?)

It sounds like either the VMs themselves or the hypervisor is "thrashing", where disk storage is used to extend the available RAM, and everything slows down as swapping starts.

1 Like