Slow gateway performance

Vince · November 15, 2017, 8:13pm

Hi everyone,

I got a major problem here. I’m trying to solve the problem since 2 weeks now.

I have a gateway (version 7.9.3) that is running a lot of projects and a lot of clients (about 160).
The problem is: When doing a system.tag.write in a tag’s changeValue script, there is a big lag on the write. I may took up to 30 seconds before the write took effect (see the result on the tag). This problem occur with memory tags so the PLC are not the problem.

I add messages log in the script to see if there is delay between the beginning of the execution and the end. The script execute very fast. No problem. But the effect of the system.tag.write is very slow.

Everything on the server looks fine. CPU is about 25%. RAM about 50%. All threads execute fine. The server was a virtual machine with 4 cores and 24 GB of RAM. From the vmware perspective, everything was running fine (almost sleep).

At some times, the problem go away by itself. It look like at some time, a queue is growing and growing and tag write become slow. All others function seems ok. Tag read are ok. Expressions tags ar ok too. A button in a client that write to a tag with system.tag.write is ok too. A system.tag.write in a gateway script (timer) is ok too.

It’s not an easy one and I really need help to identify the problem to take the good decision.
Any idea?

pturmel · November 16, 2017, 12:53am

Is this the default tag provider, or is it a database tag provider?

Vince · November 16, 2017, 1:48am

There is 2 realtime tags providers. No database tag provider.

Sanderd17 · November 16, 2017, 9:33am

How is your disk usage on the server? I guess the speed issue must show some hardware limitation, as it works fine with less clients.

Btw, tag write will always be fast on its own, it’s an async function and just sends a request to write without waiting for the answer. You can also try the writeSynchronous function to see if an actual write takes that long.

Vince · November 16, 2017, 12:55pm

From the vmware perspective and from the gateway perspective, the disk usage seems ok.

You are right about the tag write. I tryed to do a writeSynchronous and it was faster. I supposed the write call fill a FIFO that is managed by another process. At some time for some reason, this FIFO become too large or the process that managed this queue become slow.

I did some other tests yesterday. Here is what I did:

TEST #1: Goal: Test if the numbers of connected client can cause latenty.
Method: Normally gateway must run about 150 to 160 clients. I started 63 more client for a total of 215 clients running. Total client requests/sec = 677. Everything work fine… No latency detected for about 2 hours.
Conclusion: No evidence that quantity of client cause the problem
TEST #2: Goal: Test if many system.tag.write in a valueChange script can cause latency like the one observed when problems occur.
Method: I create a tag with a valueChange script. This script execute 10 000 000 system.tag.write. The gateway took about 30 seconds to execute all the writes. Meanwhile, small latency were observed but nothing more than 3 seconds. WAY LESS that 30 seconds observed when real problems occurs.
Conclusion: Small effect observed. But way less that what can be observed when the real problem occur.
TEST #3: Goal: Test if many system.tag.write from a client can cause latency.
Method: I copy the script of test #2 in the button of a client. The script execute 10 000 system.tag.write. The gateway took about 30 seconds to execute all the writes. Meanwhile, NO LATENCY were observed.
Conclusion: Everythings work fine. No effect.

Another fact: When the problem occur, it occur on ALL realtime tags provider and on all scanclass.

This one is not easy… Any idea?

pturmel · November 16, 2017, 1:09pm

Indeed. Your testing is thorough. You should probably contact support for help identifying which system loggers would be useful narrowing this down. They will probably ask you to upgrade to 7.9.4 as part of the troubleshooting effort.

Vince · November 16, 2017, 1:26pm

I already contact support and after many hours searching what’s going on they find nothing. Anyway I will contact them again today for sure.

Today I would like to do some test to move to an decentralized architecture. I would like to separate tags collection and clients management in two distinct servers.

Is it complicated to do? Some have any tips?

pturmel · November 16, 2017, 1:48pm

Consider building a timer event using system.tag.writeSynchronous to a memory tag, where you grab a timestamp before and after the call to detect when the hang is occurring. Have the event fire every few seconds. When the synchronous write takes more than a second, you could trigger a thread dump to see what is going on at that time.

Kevin.Collins · November 16, 2017, 7:10pm

How much memory is allocated to the heap for the Ignition process? Perhaps this might be the result of garbage collector operations? With larger memory sizes, using the G1 garbage collector is supposed to provide better behavior and fewer “pause the world” events? Just brainstorming, but it sounds like that might be something you could tune. Whether it fixes “a” problem or “the” problem we’ll have to see…

Vince · November 16, 2017, 8:04pm

As you can see in the graph in the first message of this post, there is 18 GB of memory allocated and Ignition use about 9 GB. So there is enough memory apparently.

I also think the same about gargage collector. So I put on the same graph the time of the issue with the memory usage. The problem does not appear only when the memory is dropping. So the garbage collector does not seem to be involved.

Sanderd17 · November 17, 2017, 9:17am

Did I read you correct that a synchronous write is faster?

If so, perhaps the OS is having a hard time context-switching to a new thread. I don’t know when exactly Ignition starts or stops new threads. Nor do I know how context switching will show up in a virtual machine. But perhaps it’s worth investigating.

Do you also tend to use gateway scripts in a dedicated thread?

pturmel · November 17, 2017, 1:32pm

No, it just doesn't return to the caller until the write is complete (or failed), giving the caller the opportunity to measure the time it takes.

Gateway scope doesn't have an event dispatch thread, so everything runs in the equivalent of a background thread. Events are issued from thread pool executors, though, so any sleeping in an event can interfere with other events. If there's ever a need to sleep or wait within a thread (rare but not non-existent), a dedicated thread is appropriate.

Vince · December 4, 2017, 4:59pm

The problem is now under control since a week approximatly.

I foud a string correlation between our problem and Kepware events.

Our server communication with PLC through a Kepware server. We had one Allen-Bradley channel containing 5 Compact/Control Logix PLC. When one of the plc goes offline, it seems the complete channel become very slow respond very slowly to write request.

So when it happens, a queue of “tag to write” in Ignition fill up. This queue seems to contain all write done by system.tag.write, even memory tags. So all the system.tag.write become very slow.

We create one channel per plc in kepware and the problem is now under control.

Kevin.Collins · December 4, 2017, 5:16pm

Ahhh, thanks for following up with us Vince… Yeah, within KEPServerEx, device communication is serialized within a “channel”. This allows for communication to things like serial devices on a multi-drop network (where you only want to talk to one at a time). Definitely not needed for ethernet devices–you are correct to have moved to a one-device-per-channel configuration.

szcz · January 15, 2020, 2:00pm

Hello,

I stumbled upon this interesting thread. I have similar symptoms as Vince had. Regarding to that, I have one question: Where do I find the tag-to-write queue?

Thanks in advance for feedback, take care.

Romana_Pivodova · November 30, 2021, 2:26pm

Hello all,
I have same problem. Since 1 month happen Ignition gateway very slow.

Very long time for opc tag changes
1 shared script is very slow

I increased memory for gate way. And haven’t any other idea what to do next.
Thnaks for other ideas.

pturmel · November 30, 2021, 2:28pm

Show us the very slow script. (Please use the pre-formatted text button so it will show indentation properly.)

Romana_Pivodova · November 30, 2021, 2:31pm

def QADBF(serialNumber,partNumber, line): ##line = GEN3,FIVES or REPACK
from java.util import Date
from java.util import Calendar
import datetime
import system
import time

	time.sleep(2)
	
	#VERIFY THE LETTERS WHICH ARE ALLOWED AND WHICH ARE ILLEGAL AND UPDATE THIS ARRAY
	alphabet = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
	
	
	now = datetime.datetime.now()
			
	creation = str(now)
	userName = 'SYSTEM'
	
	if line.upper() == 'REPACK':
	
		curIndx = system.db.runPrepQuery("SELECT TOP 1 QADINDEX FROM QADBFGEN3 WHERE QADINDEX like 'INR%' ORDER BY indx DESC",[],'OST_USERDEF')
		count = len(curIndx)
	
	
		if count < 1:
			QADINDEX = 'INRA0001'
		elif curIndx[0][0] == 'INRZ9999':
			QADINDEX = 'INRA0001'
		else:
			for i in range(len(alphabet)):
				if str(curIndx[count-1][0][3]) == str(alphabet[i]):
					if int(curIndx[count-1][0][4:])==9999:
						QADINDEX='INR'+ str(alphabet[i+1])+'0001'
						break
					else:
						QADINDEX='INR'+ str(alphabet[i])+str(int(curIndx[count-1][0][4:])+1).zfill(4)
						break
		
	else:
		
		typsql = "select * from Label_Attr where pn = ?"
		typ = system.db.runPrepQuery(typsql,[partNumber],'OST_USERDEF')
		
		prefix = 'INB'
		if typ.getRowCount()<>0:
			if (typ[0][0] == '12V40') or (line == 'GEN3'):
				prefix = 'ING'
		
		indexsql = 	"SELECT TOP 1 QADINDEX FROM QADBFGEN3 WHERE QADINDEX like '"+ prefix +"%' ORDER BY indx DESC"
		curIndx = system.db.runPrepQuery(indexsql,[],'OST_USERDEF')
		count = len(curIndx)
		
		
		if count < 1:
			QADINDEX = prefix + 'A0001'
		elif curIndx[0][0] == prefix + 'Z9999':
			QADINDEX = prefix + 'A0001'
		else:
			for i in range(len(alphabet)):
				if str(curIndx[count-1][0][3]) == str(alphabet[i]):
					if int(curIndx[count-1][0][4:])==9999:
						QADINDEX=prefix+ str(alphabet[i+1])+'0001'
						break
					else:
						QADINDEX=prefix + str(alphabet[i])+str(int(curIndx[count-1][0][4:])+1).zfill(4)
						break


	#backflush_datastream for flat file creation for delivery for QAD Backflush
	
	backflush_data= "rfrebkflo.p,"
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+="MES,"
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+="STARTFG,"
	backflush_data+= serialNumber +","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+= partNumber + ","
	backflush_data+=line + ","
	backflush_data+=","
	backflush_data+="1,"
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+= QADINDEX +","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+= line + ","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","#backflush_data+="ITZEBRA,"
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","	
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+=","
	backflush_data+="\n"
	
	
	oldserver = "qvara12ldb01"
	newserver = "sfr0eg6xdb01"
	
	server = "qxxxxxxxxb01"
	share = "prodshare"
	filePath = "ea******queue/" + QADINDEX
	domain = "o****d"
	user = "od*****mba"
	pwd = "**********"	
	
	
	
	system.net.writeFile(server,share,filePath,domain,user,pwd,backflush_data,1)
	system.db.runPrepUpdate("INSERT INTO QADBFGEN3 (QADINDEX,SERIAL,MODEL,CREATED,DATE,USERNAME,SAMPLEFILE,T_STAMP) VALUES(?,?,?,?,?,?,?,?)", [QADINDEX,serialNumber,partNumber,creation,datetime.datetime.now(),userName,backflush_data,datetime.datetime.now()],'OST_USERDEF')

pturmel · November 30, 2021, 2:39pm

{ Note: highlight everything pasted before using the preformatted text button. You can edit your post to move the marker to the beginning of your code. }

You have a time.sleep(2) call in your code. Don’t do this. Not with a sleep function. Not with an infinite loop. Just don’t. Record a timestamp in a memory tag, and use a timer event to run your code two seconds after that timestamp. Or a more complicated state machine if needed.

Also, don’t use python’s datetime objects. Use Ignition’s system.date.* functions or directly use java.util.Date.

Romana_Pivodova · November 30, 2021, 2:42pm

THX, I will try