We have about 60 tags that use the system.alarm.queryStatus
function to summarise alarms in different folders, and we do still have all of the alarm status tables within Vision but these haven’t caused issues in the past. I’m considering cutting those 60 function calls down to 1 and then using the result of that to filter down for the other areas. They’re being called every 5s
We just had a Vision client become unresponsive, relevant client log attached. See note at the top for timeframe.
It became unresponsive when they clicked on a button. Button script screenshot below and resulting error (error took a while to show up, in the meantime the client was unresponsive)
It was unresponsive because you made a blocking write call on the GUI thread at a time when the connection to the gateway was apparently lost, so appeared frozen for 60 seconds or whatever the default timeout is.
How would I check why the connection to gateway was lost? These are clients on the same network, so they should never have dropouts unless it’s an application thing
Not sure… it could be a genuine network problem or it could be the gateway wasn’t responding, but if it was the gateway not responding other clients likely would have lost their connection as well.
That log you uploaded has multiple instances where the client it’s from lost its connection to the gateway. Looks like 10 times within 45 minutes it reports no response and that the connection was lost.
This should have the CPU usage in it
nminchin, do you have redundancy enabled?
We experienced high CPU usage while the backup was connected to the same OPC.
There was a fix issued in 8.1.3. If you are not there yet I would recommend upgrading.
I would like to note that we are not using Perspective yet. Only Vision clients.
We cannot give specific help since every deployment has different requirements / characteristics, but
I hope you are not doing performance testing / capacity planning in your (customer's) production environment.
Our customers do that in dedicated labs to avoid affecting their mission-critical deployments.
Here's a 5 minute video of a smallish 200,000 tag scenario:
Be careful when setting up an Ignition server as an EAM controller. We saw excessive CPU usage when Config/Network/Gateway Network Settings/General Settings/Allow Proxying was set to true. This setup many to many relationships across all Gateways on the network. In my case, the issue was resolved when Allow Proxying is only set on the agent servers and not on the EAM controller. We are using a dedicated EAM server and the CPU dropped from over 70% to just a few percentage points.
Yes, we moved to direct connections for any gateways that need to talk to each other a while back. Otherwise the proxy gateway gets weighed down much more than it would look like it should based on how a direct connection looks.
I think we are have the same issue as nmichin. However, we are starting a new project using 8.1.3 and only using Perspective. We have over a million tags. But what we have noticed is that the thread count will be fine for the majority of time and everythig works really well for days or a week. Then suddenly, the thread count will slowly increase linearly until it crashes. CPU stays at around 25-30% and memory are fine. We are talking with support but yet to pin point anything. See below, the thrtead count was stable then suddenly increase all the way up until it crashed.
Support are probably already looking at your thread dumps already, but I wrote a couple of scripts to help try to diagnose the issues or at least get a better perspective on them.
The first script has the function saveThreadDump()
which will use the system.util.threadDump()
function to take a thread dump. This is far more useful than dumping from the gateway webpage as it actually has CPU usage against each thread. I attached this to the CPU utilisation system tag change event if it gets above 0.50 with a 30s cooldown - i send myself an email when it runs as well. This in particular has been extremely useful for support as it’s captured logs when the CPU has been high in different scenarios
The second will take a thread dump created by this and convert it into a csv that you can paste into Excel where you can then sort by CPU usage and other things. Note: it does also work with the webpage dump as well. Both dumps are slightly different, but the headers I’ve kept the same so there might be blank values for some headings depending which dump you’ve converted.
I run the second script in PyCharm using Python 3.9. Haven’t checked if it’ll run in Ignition. You’ll need to install the pywin32 library for Windows to get access to the clipboard, otherwise you could always just write the csv contents to a file.
You’ll end up with something like this:
def saveThreadDump():
'''
Description:
Takes a thread dump and records the current CPU, memory, and disk utilisation at the top of the file.
This method is far more useful that using the webpage thread download button as it contains CPU usage per thread.
File is stored local to the client that issues the function call.
'''
now_formatted = system.date.format(system.date.now(), 'yyyyMMdd_HHmmss')
filename = 'IgnitionThreadDump_AU6_{}.txt'.format(now_formatted)
fileFolderPath = 'C:/Ignition Reports/Thread Dumps/'
filePath = '{}{}'.format(fileFolderPath, filename)
dump = system.util.threadDump()
# collect some basic overall performance stats and record them at the top of the file
stats = system.tag.readBlocking(['[System]Gateway/Performance/CPU Usage', '[System]Gateway/Performance/Disk Utilization', '[System]Gateway/Performance/Memory Utilization'])
cpu_util = stats[0].value*100
disk_util = stats[1].value*100
mem_util = stats[2].value*100
dump = 'CPU Utilisation: {:.2f}%\r\nMemory Utilisation: {:.2f}%\r\nDisk Utilisation: {:.2f}%\r\n{}'.format(cpu_util, mem_util, disk_util, dump)
system.file.writeFile(filePath, dump)
return filePath
Tag change script on [System]Gateway/Performance/CPU Usage
:
cpu_log_condition = 0.5
if currentValue.value > cpu_log_condition:
now = system.date.now()
lastDump = system.tag.readBlocking(['[default]System/Diagnostic/Last Thread Dump Time'])[0].value
if lastDump is None:
lastDump = system.date.parse('1987-01-01 00:00:00')
dumpLogRate = system.tag.readBlocking(['[default]System/Diagnostic/Thread Dump Log Rate'])[0].value
if system.date.secondsBetween(lastDump, now) > dumpLogRate:
filepath = shared.dev.diag.saveThreadDump()
shared.errors.sendEmail('CPU Usage High {:.1f}%'.format(currentValue.value*100), \
'Thread dump saved to disk with filepath: {}'.format(filepath))
Thread dump converter function to convert to CSV
import sys
import re
import tkinter as tk
from tkinter import filedialog
from tkinter import messagebox
import win32clipboard # part of library: pywin32
def setClipboard(text):
# set clipboard data
win32clipboard.OpenClipboard()
win32clipboard.EmptyClipboard()
win32clipboard.SetClipboardText(text)
win32clipboard.CloseClipboard()
def getClipboard():
# get clipboard data
win32clipboard.OpenClipboard()
data = win32clipboard.GetClipboardData()
win32clipboard.CloseClipboard()
return data
def choose_file():
root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
return file_path
def parseThreadDump():
file_path = choose_file()
f = open(file_path, "r")
thread_dump = f.read()
f.close()
thread_dump = thread_dump.splitlines()
thread_id = ''
thread_type = ''
thread_cat = ''
thread_name = ''
thread_cpu = ''
thread_status = ''
thread_jobs = []
headers = ['ID', 'Type', 'Thread Category', 'Thread Name', 'CPU', 'Status', 'Job Count', 'Jobs']
rows = []
# find the ignition version row. Should be at the top, but user may have added additional stuff there such as cpu/mem/disk utilisation
for line_num, line in enumerate(thread_dump):
find = re.findall('(Ignition[a-zA-Z :]+)(\d+.\d+.\d+)( \(b[\d]+\))', line)
if len(find) > 0:
ignition_version_line_num = line_num
ignition_version = '{} {}'.format(find[0][1], find[0][2])
break
# check what type of dump was created (from webpage or system.util.threadDump, as they are formatted slightly differently)
# threads start on the 2nd line after the ignition version row
if 'id=' in thread_dump[ignition_version_line_num+2]:
dump_type = 'webpage'
thread_jobs_prefix = ' '
else:
dump_type = 'script'
thread_jobs_prefix = ' '
for line_num in range(ignition_version_line_num+1, len(thread_dump)):
line = thread_dump[line_num]
# if the line contains thread jobs
if line[0:len(thread_jobs_prefix)] == thread_jobs_prefix:
thread_jobs.append(line.replace(thread_jobs_prefix, ''))
# if the line contains CPU usage (only applicable for dump_type == 'script', webpage doesn't have this info)
elif "CPU: " in line:
thread_cpu = re.findall('[\d.]+', line)[0]
# if the line contains thread status (only applicable for dump_type == 'script', webpage has this in thread line)
elif "java.lang.Thread.State" in line:
thread_status = re.findall('(: )([A-Z_]+)', line)[0][1]
# ignore any blank lines
elif line in ['','"']:
pass
# else assume it contains a thread definition
else:
# if a previous thread is currently in memory, push it into the rows list
if thread_status != '':
thread_details = [thread_id,
thread_type,
thread_cat,
thread_name,
thread_cpu,
thread_status,
str(len(thread_jobs)),
thread_jobs]
rows.append(thread_details)
# clear the variables
thread_id = ''
thread_type = ''
thread_cat = ''
thread_name = ''
thread_cpu = ''
thread_status = ''
thread_jobs = []
# record the current thread's info
try:
if dump_type == 'webpage':
thread_id = re.findall('id=(\d+),', line)[0]
thread_type = re.findall('^([a-zA-Z ]+) \[', line)[0]
thread_name = re.findall('\[([\w -./:\d@]+)\]', line)[0]
thread_status = re.findall('\(([\w_]+)\)', line)[0]
thread_cat = re.findall('([\w -./:\d@]+)-[\d]+', thread_name)
thread_cat = thread_cat[0] if len(thread_cat) > 0 else thread_name
elif dump_type == 'script':
thread_name = line.replace('"', '')
thread_id = re.findall('-([\d]+)"', line)
if len(thread_id) == 0:
thread_id = ''
else:
thread_id = thread_id[0]
thread_cat = thread_name.replace("-{}".format(thread_id), '')
except Exception as e:
print(e)
print('thread_name=' + thread_name)
sys.exit()
if thread_id == '16':
pass
# if a previous thread is currently in memory, push it into the rows list
if thread_status != '':
thread_details = [thread_id,
thread_type,
thread_cat,
thread_name,
thread_cpu,
thread_status,
str(len(thread_jobs)),
thread_jobs]
rows.append(thread_details)
table = [headers]
table.extend(rows)
return table
def convertThreadArrayToExcelTable(rows):
headers = rows[0]
text = '"' + '","'.join(headers) + '"\r\n'
data = rows[1:]
for row in data:
# replace the jobs list with a text version with new lines for each job
row[-1] = '\r\n'.join(row[-1])
for row in data:
try:
text += '"' + '","'.join(row) + '"\r\n'
except Exception as e:
print(e)
return text
table = convertThreadArrayToExcelTable(parseThreadDump())
setClipboard(table)
messagebox.showinfo(title='Extracted Thread Dump', message='Extracted thread dump into a table. Copied to clipboard to paste into Excel.')
thanks for that Nick,
That look great. Only problem i can see is for us using that code it gets triggered on above 50% CPU usage. Ours sometimes goes over that CPU usagae and is fine other times it is not. It looks like if i copy that, it will be creating a lot of thread dumps for no reason. It doesn’t look like CPU usage correlates strongly to the gateway crashing for us. It would be good if we could monitor the thread count particular the threads waiting. That seems like trigger I need. If you see the picture below , CPU goes about 50% often but to the far right we had a crash. The thread count increaseing was relatively only a breift amount of time for that time scale.
All the spikes in the java total thread count is when we believe the crashes to our gateway have occured. thats why i would like to use that as a trigger in ignition but i dont know how to get that. The last 14 days shown. We also are thinking it has somthing to do with the alarm status table.
You still might be able to set it to above some of the higher peaks as it will only record one dump in x (30) seconds. It might give you something 🤷
Right now i can see the thread count slowly increasing again. Yet CPU is very low. See attached.
BTW ive been getting thread dump infomation by downloading java from oracle so i don’t need to take a thread dump fromt he gateway. Doing from the gateway normally crashes the gateway:
https://www.oracle.com/java/technologies/javase-jdk11-downloads.html
I then install it.
Then i run from CMD with admin rights.
Change directory to where it was installed and go to the bin folder:
cd C:\Program Files\Java\jdk-11.0.10\bin
Then I type in the following, where process id you get from the task manager (PID) for the Zulu Platform and chose a place to save it with file name at the end:
jstack {process id} > D:\logs\Thread1.txt
Consider using java.lang.Thread.enumerate() in a gateway timer script to count the number of threads without dumping them and logging that. And/or conditionally calling out to jstack
for the externally executed thread dump.
@james_electriceng I think what you’re observing is quite distinct compared to what @nminchin is working through. I think we’d want to see a thread dump as the thread count is climbing, not when CPU is high.
Here are some thread dumps:
Thread_2003161430.txt (992.0 KB) Thread_2003161431.txt (1006.8 KB) Thread_2003161434.txt (1.5 MB)
@james_electriceng
Just wanted to check how you went diagnosing your crashes? Did you get this sorted? and what was the issue?