High CPU Usage - Diagnosing

nminchin · May 8, 2018, 7:30am

Hi,

I’m looking to reduce currently relatively high average CPU usage of an Ignition project (~58% with Ignition using ~45-50%) however I’m not sure the best place to start looking.
Are there any guidelines to designing an efficient project?
Is there a particular log file that I can look at? I have found the Threads diagnostics log contains the threads and their CPU usage, however none of the names mean much to me. The majority of the high CPU usage items, adding ~20% load, are named ‘webserver-xxxxxx’. What are these threads used for and can I reduce their usage at all?

Edit: an example of a webserver TIMED_WAITING detail is:

Thread [webserver-3214848] id=3214848, (TIMED_WAITING for java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6774de57)
sun.misc.Unsafe.park(Native Method) 
java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source) 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source) 
org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:392) 
org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:546) 
org.eclipse.jetty.util.thread.QueuedThreadPool.access$800(QueuedThreadPool.java:47) 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:609) 
java.lang.Thread.run(Unknown Source)

and a webserver RUNNABLE detail:

Thread [webserver-3214852] id=3214852, (RUNNABLE) (native)
sun.nio.ch.WindowsSelectorImpl$SubSelector.poll0(Native Method) 
sun.nio.ch.WindowsSelectorImpl$SubSelector.poll(Unknown Source) 
sun.nio.ch.WindowsSelectorImpl$SubSelector.access$400(Unknown Source) 
sun.nio.ch.WindowsSelectorImpl.doSelect(Unknown Source) 
sun.nio.ch.SelectorImpl.lockAndDoSelect(Unknown Source) 
sun.nio.ch.SelectorImpl.select(Unknown Source) 
sun.nio.ch.SelectorImpl.select(Unknown Source) 
org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:233) 
org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:181) 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:171) 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156) 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654) 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) 
java.lang.Thread.run(Unknown Source)

Thanks in advance.

Nick

Sanderd17 · May 15, 2018, 7:03am

The webserver is just the gateway webpage. Sadly, the performance reporting is quite heavy on the web server, so you’ll always see it there when you have that page open. But if you close the page, it shouldn’t use any noticeable amount of CPU anymore.

Instead, most of the CPU is wasted in the “unclassified” part. Which seems like it’s not a CPU problem, but rather the system waiting on something else, like disk reads for swapped RAM.

robert1 · May 22, 2018, 10:59pm

Hello-

I’m going to double up on this thread. I’m having the same issue, but inconsistently. In my case though the high CPU usage ends up crashing the gateway because it takes up 98+% of gateway CPU. I have to manually terminate every open client and then reset the gateway to get the issue to go away, then CPU usage goes back to 10-20%.

I’m having a heck of a time figuring out what is causing the problem. Any advice for how to track this down?

This has happened maybe 3 times in the last 3-4 weeks. When I look at the thread diagnostics the top 5 threads (using between 12-17% CPU each are called webserver-xxxxxxx, as nminchin mentioned.

I’m the only person on our network that opens the gateway webpage, so does it make sense that there would be 5-10 separate threads for webserver running?

Kevin.Herron · May 22, 2018, 11:30pm

Can you take some thread dumps when this happens? The traces posted by nminchin don’t have anything helpful in them. The CPU usage figure and the stack trace you’re seeing are not in lockstep, unfortunately, so it might take some lucky timing to capture actual activity on the threads.

jordan5316 · July 10, 2018, 11:18pm

I am having the same issue with high CPU usage. Not sure how to upload the ThreadDump file. Please advise.

nminchin · July 11, 2018, 12:13am

Hi Kevin,

I’ve finally got some time to grab some thread dumps. The CPU is now hanging around the 65% mark which I would really like to discover why! I do have a fair few tag change events. Is it possible to tell if these are the culprits from the dumps?

thread_dump (1).txt (145.4 KB)
thread_dump.txt (143.9 KB)
thread_dump (2).txt (166.5 KB)

Julio.Bautista · July 26, 2018, 3:26pm

I’m having the same issue. We added more cores and memory to our server and are now seeing that high memory usage is causing our gateway to crash. We also have tag event scripts but the odd thing is we have a development gateway with the same code and not experiencing the same crashes like in our production ignition gateway

pturmel · July 26, 2018, 5:18pm

If you haven’t adjusted your ignition.conf to use G1GC instead of CMS, do that first. Seriously.
Also use the GC logging options until you have finished tuning.

Julio.Bautista · July 26, 2018, 6:01pm

We are using the G1 garbage collector.

Julio.Bautista · July 26, 2018, 6:19pm

I tried looking at the thread dumps to see if I’m able to get some more information on these gateway-shared-exec that are being blockedthread_dump (2).txt (101.2 KB)

pturmel · July 26, 2018, 6:26pm

I doubt the thread dumps will help. You have a memory leak. You should start with a class usage histogram captured with the jmap tool or equivalent. Compare that to a histogram captured before the memory consumption ramps up. Note that Ignition will stall for a few seconds while the histogram is captured in most cases.

PGriffith · July 26, 2018, 6:26pm

The named query cache maintenance system got revamped (for performance) in 7.9.9. That’s the root cause of the issue.
You probably didn’t encounter the issue on your dev system because without active users you wouldn’t be building up a significant cache. Since 7.9.9 currently has an RC out, the release should be out within a couple weeks.

Julio.Bautista · July 27, 2018, 1:41pm

So will setting my name queries to stop caching when I create them help mitigate the issue I am currently having?
Is it possible to run the clearNamedQueryCache on the script console of my designer in order to clear the cache.

PGriffith · July 27, 2018, 2:57pm

I believe disabling caching should help, but I am not very familiar with the caching mechanism so I’m not entirely sure what the change involved - only that it was specifically designed to fix this kind of issue.

jordan5316 · July 30, 2018, 6:30pm

Unchecked the cache function under the named query has solved the high usage issue.

choy · December 12, 2018, 8:39am

Guys I need Help,

Server Specs:

HP Proliant G9 4cores CPU, Not Hyper Threading
Windows Server 2008R2 Standard Vm, 4Cores CPU, 12Gb RAM

Ignition Gateway:

ver. 7.9.6
45564 Tags
4 PLC connection
MySQL DB
6 Client Stations

Issues in the Field:

After a long Run of the Gateway CPU usage would go up to 100% which cause operational problems.
When we Reboot the Gateway it will go back to normal.
is there any fixed?

Using our Office Server Dell PowerEdge 24cores CPU & 128GB RAM, Support Hyper Threading:

We Run the Gateway Server Vm with 8cores CPU & 12GB of RAM. it Works Fine
But for our simulation Test.

Screen Shots Gateway Status after Reboot.

pturmel · December 12, 2018, 1:37pm

A snapshot after reboot isn’t much help. A snapshot while the problem is occurring would be helpful. But also post your ignition.conf, as requested in the other post.

ps. Please don’t post in multiple topics for the same problem. This should all be in the new topic.

oscar.salcedo · December 12, 2018, 2:10pm

I am glad this topic popped up again as we are seeing High CPU usage on our Gateway:

The CPU ranges anywhere from 15% to 99%. There are several PTW events logged:

We have a relatively large project: 7.9.9, MES-heavy (Sepasoft), 4 sites running on a single Gateway with over 180 clients, 160 queries/sec, and 143 devices. The Gateway runs on an ESX Cluster Node, with 8 other VMs that use very little resources (our VM is by far the top consumer).

I have attached a copy of our ignition.conf file and a thread dump during a CPU spike.thread_dump.txt (619.2 KB)
ignition.conf.txt (8.4 KB)

pturmel · December 12, 2018, 2:29pm

You have both CMS and G1GC set in your ignition.conf. G1GC is last, so that was chosen, but I recall reading that CMS settings can interfere with G1GC. Get rid of the wrapper.java.additional.1, .2, and .3. Add a target pause time. Consider logging garbage collector performance to determine if that is the cause of your clock drift warnings. See these topics:

oscar.salcedo · December 12, 2018, 3:17pm

Thanks @pturmel!
Here’s what the GC section looks like now:

wrapper.java.additional.1=-XX:+UseG1GC
wrapper.java.additional.2=-XX:MaxGCPauseMillis=100
wrapper.java.additional.3=-Ddata.dir=data
wrapper.java.additional.4=-Dorg.apache.catalina.loader.WebappClassLoader.ENABLE_CLEAR_REFERENCES=false

We have a planned release to production tomorrow and we will restart the Gateway service for these to take effect. I will keep this thread updated.

Oscar.