Poor Gateway Performance

Thomas_Putteman · June 11, 2021, 11:23am

Hi all,

We are running a project with a lot of problems. This problem has to do with a high delay between Gateway and sessions.
Our memory trend looks normal when the application runs fine (spikes). But when it gets slow, the memory trend looks more stable and slowly climbing up.
I took a printscreen of the memory usage during this behaviour
CPU

So looks like nothing is happening with the tags?

During the pikes, the application is working fine. But once we see the memory trend as a more stable line, the application acts very slow. This means that when we open a new window, all images are displayed in a second, but the data that comes with these images is getting tag overlays. (OPC waiting).
When we take a look with the OPC Quick client, the data is updating every second so that looks OK.

During the delays, we also get following errors:

These disappaer when the sessions are quick and responsive again.

Also, when we open the designer, the data that should come in through the tag browser, isn’t updating as well. After a few minutes (up to 30 minutes!) the data gets update really fast a few times and then it stops again. The timestamp isn’t actual either. There is a difference of a few minutes. So looks like the data is buffering for some reason.

Weird thing is, that this application can run fine for a few weeks, and suddenly it can get these delays. The delays hold on for a few days or weeks, and dissappaer again.

Can anyone help us with this problem? Any advice is very welcome!

Details: We are running on a Linux server, 250K tags with OPC UA, a session can hold up to approx 5000 subscriptions. Up to 10 sessions. Created with Ignition 8.1.2. G1C1 garbage collector. All templates are created with Indirect tags (tagpath).

nminchin · June 11, 2021, 12:20pm

You’re probably best off talking to support.
Theyll want to look at thread dumps to see what threads and processes are running at the time of the issues as well as compare them to what they look like before the issues. Have you checked in your wrapper log for any issues?
Do you have lots of scripts that run periodically, e.g. In tag change events, timer scripts, etc.?
Do you have lots of expression tags that aren’t set to event driven? (this is the default in v8, but this didn’t exist in v7 and isn’t changed when you upgrade to v8)
Are you using the perspective alarm status table? This has known performance issues currently.
Are you calling functions like system.alarm.queryStatus or tag browse functions?

Do you have periodically called functions that don’t have adequate error handling?

Thomas_Putteman · June 11, 2021, 12:29pm

Hi nminchin,

I took this to support but with no results so far. That’s why I was curious if someone could relate to this problem.
We do have a lot of scripts that run in tag change events. But these scripts are testes and used in bigger projects with a lot more tags (so a lot more tag event scripts that run at the same time).
The project is made in Vision, so no perspective alarm status table. (good to know that this one has performance issues because future projects will be in perspective).

The scripts that were triggered due to a tag change events have been disabled without result. So we don’t think the scripts are a problem.

This has been going on for a month now. So we are kind of desperate.

BR,

Thomas

nminchin · June 11, 2021, 12:48pm

Have you used phone support or only online through the portal/email? If you havent phone supported, then that’s the way to go and have a rep remote in to have a look.
What about my other question re wrapper log errors?

What’s the cpu usage like when you get the issues? I would suggest turning on history for the cpu and memory utilisation system tags if they’re not already on.

To take thread dumps that have the cpu usage info for each thread, you need to use the scripting function. Have a look here Is anyone else having crippling issues after adding Perspective to projects? - #32 by nminchin for what I used. Note, these issues in my post weren’t only caused by perspective, but were caused by various other things as well, ultimately a culmination of years of building on and on, and not doing some rework of things that worked for a smaller project, but then became too inefficient for a larger system. (inexperience at the time was the main factor)

pturmel · June 11, 2021, 12:49pm

The CPU consumption by the store and forward system is suspicious. Is your database on the same machine? Is there any possibility that your DB is bogging down?

Thomas_Putteman · June 11, 2021, 12:55pm

Hi Pturmel,

The Ignition gateway is running on a VMWare and the Database is running on a different VMWare as well. So they are on the same physical machine, but separated.
The only difference with the other projects, is that we use store and forward principles. Could the problem be there?

BR

nminchin · June 11, 2021, 12:58pm

What do you mean by principles?
What are the stats of your store and forward system on the status page in the gateway webserver? Can you screenshot it?

Thomas_Putteman · June 11, 2021, 1:08pm

We had to setup store and forward for this project. The master gateway communicates with different local/edge gateways. So the data that comes from a PLC is transferred to the local gateway and the master gateway (tag splitter). But we had to use store and forward as well because the databases need to be the same all the time (so even when a connection is lost). A side effect of this is that we have a lot of quarantained items, because the same data is transferred twice in the same database. We receive following warnings:

To answer on the other questions:
Our CPU usage is 20-25% all the time. It stays pretty stable. With or without the problems we are experiencing.

I’ll take a look at your post as well.

BR

Thomas_Putteman · June 11, 2021, 1:15pm

I’ve also noticed this:

Seems like they are appearing together.

BR

pturmel · June 11, 2021, 2:19pm

Are the resources for each VM dedicated? By that I mean the sum of all memory allocated to VMs is less than the total physical memory of the hypervisor. And the sum of all CPU cores allocated to VMs is less than the number of physical cores of the hypervisor. If either statement is false, add resources to your hypervisor.

That said, I really think you should have support look at your tag history configuration. I'm not an expert on the tag historian, but a proper config will not generate primary key conflicts or failures to create partitions.

paul-griffith · June 11, 2021, 5:07pm

You have a hundred million records (at least!) in quarantine - that’s what’s causing high CPU usage, because HSQLDB is not designed to handle that many records. @pturmel is correct - you need to fix your historian/storage settings.

Thomas_Putteman · June 11, 2021, 5:14pm

Thank you for the feedback.
Could this really be the problem? I mean, it is not correct, thats for sure. But why would the session be quick and responsive for a few days, and then it just is very slow and no tags are updating?
I’ll definitly check it, but it does not makes a lot of sense?

BR

JordanCClark · June 11, 2021, 7:02pm

Trying to wrap my head around this part. Are you trying to use the historian to multiple databases and using replication at the same time?

Thomas_Putteman · June 11, 2021, 7:03pm

Yes that’s correct.
It is a demand of the customer. Could this give the problems?
They want their databases equal all the time. Even if one location looses its connection with the master gateway.

BR

JordanCClark · June 11, 2021, 7:10pm

IMO, yes. It sounds like they want the benefits of a High Availabilty Cluster without an actual High Availability Cluster.

Thomas_Putteman · June 11, 2021, 7:10pm

Also, in the threads is see this:

ANd when I open a session, and everything works normal, I expect around 230 tag subscriptions. However, when we are having the delay, I only get f.e. 100 tag subscriptions for some time.
Could this all have to do with the architecture of this project?

Thomas_Putteman · June 11, 2021, 7:12pm

So you would suggest that we start focussing of getting our data only to the master gateway. And that we don’t use the store and forward mechanism for now?

JordanCClark · June 11, 2021, 7:17pm

It’s not the store and forward that’s causing the issues, per se, it’s that you’re writing to two database tables, and if they happen to be replicating faster than the store and forward, then it will be put into quarantine.

IMO, the application should only worry about writing to one node, and the cluster takes care of the master / replicants behind the scenes.

Thomas_Putteman · June 11, 2021, 7:25pm

Great. I’ll try that.
So I would suggest that all the data from measurements and stuff like that is placed in one tag history splitter. This history splitter only writes to 2 databases. All other connections, I will drop them and see what’s the result. This way I am storing the data in 2 databases and that it. No replications or whatsoever.

If I am talking nonsense, please tell me. I am not that experienced with databases…

BR

Thomas_Putteman · June 13, 2021, 10:38am

So I’ve tried disabling all gateway network communication and other database stuff that I don’t need. So I no longer receive quarantained items and stuff like that, unfortunately, my gateway performance is not good.
So in best cases, I can open a vision client, and everything pops up emidiately. Even when I would change windows and try to give commands with the components, everything is fast and responsive. But in worst cases, i try to open a client, and then nothing happens. When I take a look at the tag subscriptions/client in the gateway webpage, I see that not all subscriptions are coming in. After that, it can take up to 30 minutes (worst case) before everything is populated. When I open another window, same things are happening again. And suddenly the application gets fast again.
Any tips?

BR