Ignition Startup Problem and Stops while running

fahadfrq · June 26, 2012, 10:29am

We have 2 servers which used to run FactorySQL for the past 2 years, We recently upgraded to Ignition. Following are the server specs

Windows 2003 R2 Standard Edition SP2
Quad Core 2.26 Xeon, 4 GB Ram
Java 6 Update 33
Ignition 7.5
Apart from Ignition, I have Rockwell’s FactoryTalk Gateway and FactoryTalk Service Platform Installed on the servers.
The Ignition project has 402 groups with around 40000 tags in Total
Ignition conencts with 4 SQL Server 2005 databases and only 1 FactoryTalk Gateway OPC Server

Now, since we upgraded to ignition, many a times, the ignition has stopped working with the following messages flooding the wrapper logs.

Failed to post the message to the thread. ThreadID = 3780, errorCode = 1816
com.jniwrapper.win32.LastErrorException: Not enough quota is available to process this command. (Refer to WraperLog1 attached to see the full message which is repeated 1000s of time)

While in this state, the gateway webpage doesnt open nor does it give an error, it just goes into waiting. No groups seem to be running and neither is any data being cached.

Suffice to say, it has been frustrating to run the same project upgraded from FactorySQL in ignition. There is sufficient C Drive space as well as I have monitored the server while this error is being logged into the wrapper logs and the JAVA.exe is cose to 350 MB memory and the remaining server is functioning fine. Please any help on this would be great.

Another thing is that similarly, many a times, when the server is restarted, the master will ask the backup node to become inactive but would not startup itself properly also or take too long to startup, which causes data loss. Is there anyway for the Master to only take control, after it has started up correctly? I was looking at the startup conenction allowance but from the manual, I dont think I really understand what effect it has. The backup becomes the master quickly because it is set to Warm backup but when the master is restarted, the backup becomes inactive after being “asked” by the master , even though the master hasnt started up correctly.

Refering to Wrapper Log, it stops at random messages. I am attaching WrapperLog2 which has a typical startup but pelase note that it stopped at 2012/06/26 13:19:05 ie. it wasnt becoming active and had made the backup inactive long ago, so I had to restart it again.

I had to setup a group to log time every minute and then had to write an app to monitor and force restart the server if it gets stuck. I upgraded Java6 update 31 to update 33 today and still on some restarts, ignition has failed to become running properly. My users are at their patience limit and I am thinking about going back to FactorySQL, since the project is the same which ran in FactorySQL fine for years.
wrapperLog2.log (33.9 KB)
wrapperLog1.log (3.02 KB)

Greg.Buehler · June 26, 2012, 1:27pm

At 2012/06/26 13:17:18 you have the OPC server returning an error code of 0xC004000C which means you’ve got duplicate tag or group names somewhere. This seems to be causing the OPCUA server to bomb out. Not sure if this might be an actual duplicate in your tag database, or is a symptom of another issue.

The message in wrapper1 “Not enough quota is available to process this command.” is basically telling you that you’re out of memory. I’d look at making sure Ignition is configured to use the proper amount of memory by editing ignition.conf from

# Maximum Java Heap Size (in MB)
wrapper.java.maxmemory=1024

to

# Maximum Java Heap Size (in MB)
wrapper.java.maxmemory=3584

If that doesn’t help, I’m guessing there is a memory leak.

dave.fogle · June 26, 2012, 5:27pm

If you're running 32bit Java then you won't be able to bump the maximum java heap size up in the ignition.conf file to much more than 1500MB. Attempting to do so will likely result in your gateway not being able to start. You could however run 64bit Java and 64bit Ignition which do not impose a limit on the the amount of memory you can allocate for the Java heap space. That being said I'll get some further clarification on whether or not this issue is indeed being caused by a lack of memory and what the necessary fix would be.

As for the issues regarding starting up of the master node in your redundant system, what exactly do you mean by the master not starting up correctly? Does the gateway page never become available? Does the connection status on the backup machine say that it's connected to the maser node?

Colby.Clegg · June 26, 2012, 5:58pm

Hi,

I suspect the duplicate name problem could be due to a previous subscription being maintained by the server, due to the disconnect being caused by other problems. In log 2, it hits that error, and then seems to connect ok, only to have the server machine restart 5 minutes later. Did it work for those 5 minutes?

We haven’t run into the subscription name problem before, but there are probably some things we can do to improve Ignition’s behavior in that situation.

Now, as for the quota errors… I think the problem here is possibly a deadlock occurring between Ignition and the OPC system, and not so much a memory problem. When this is happening, please go to Console>Threads in the gateway, and see if it reports a problem. If you see any threads marked as “blocked”, export the thread dump and post it.

Finally, for the redundancy switch over, when set to “automatic”, the master requests control during the connect procedure. There isn’t currently a delay that can be set to affect this. When this happens, the system is “running”, but as you’ve observed, it’s possible that some subsystems might still be starting. We can definitely look at putting in a delay setting, but perhaps in your case it would be better to set the master recovery mode to “Manual”, and go in and switch it by hand when you know the server is back up. I know it’s not the best solution, but it may be the best for now.

So, to summarize:

I’ll look into the opc error, and a setting for delaying switch back
For you, try to catch the threads when the system is reporting out of quota.

Regards,

Colby.Clegg · June 26, 2012, 11:36pm

Here’s a slightly modified version of the OPC-COM module that uses a different scheme for subscription names, that should hopefully eliminate that problem. You can upgrade from Configuration>Modules in the gateway. You shouldn’t need to restart the gateway, but it would be better to plan for a time when you could if necessary. When you upgrade, it should also send the module to the backup node.

Let me know if this helps. It’s for Ignition 7.5, that I saw you were running in the log file.

Regards,
OpcCom-module.modl (1.59 MB)

fahadfrq · June 27, 2012, 4:26am

thankyou for a detailed reply. I have updated the OPC Com module. Slightly reserved about the fact that it is a beta version and not available in the latest release but as long as it is stable enough, I am ok. My FactoryTalk gateway has various RSLinx Classic, RSLinx Enterprise and Kepware servers all created in unique named areas and the topics are different so that the qualified Item name should be unique. So I will attribute the duplicate tag name to old subscriptions in the OPC Com Module.

Regarding the Java memory, if there is a memory leak, wudnt even 4 gb be too little? is there any benchmarking that for system with this many Tags and Groups, a 64bit version is suggested?

Regarding switchover in redundancy. Say If i restart the master node, when the master comes back up, the backup will become “Not Active” again (on request from the Master) even though the Master has not started all the modules still and if for some reason fails to initialize all components properly, then no groups will be running. When the master is available agin, it should not ask for control from the backup unless all components have been initialized.

After the module installation, I had to restart the master server twice since It was getting stuck again. I am attaching the wrapper1.txt log. I found the following messages repeating

Tried to perform RPC operation on an unrecognized session. May indicate that the previous session was lost. Creating RPC listener for new session id.

Join-wait time has expired, will refresh system state.

During this time the webpage could not be opened also and the backup had already become “Not Active”

I restarted the backup also and then the master and then both became ok. Since then I tried restarting the master 4 times and it started correctly everytime . Did it have something to do with the new OPC Com module wanting a restart of both servers? I need the peace of mind that everytime ignition is restarted, it will initialize ok.

I am attaching the Wrapper2.txt which has the sequence of the last correct restart. Maybe you could check the behaviour of the new OPC Com module to see that It is working correctly and was indeed the Problem Fixer.
wrapper2.txt (894 KB)
Wrapper1.txt (31.9 KB)

Colby.Clegg · June 27, 2012, 6:50pm

Yes, I understand, but wanted to get something up that you could try. The only difference between that version and the released version is that the new one doesn't specify the subscription name- it lets the server generate it. This is defined by the OPC specification, so it shouldn't cause any problems, but I suppose that certain OPC servers might not implement it correctly. However, if your system is currently working ok, you won't have a problem.

The "duplicate" that we're talking about in this case isn't your fault, and it's not multiple tags. It is the "subscription name" which Ignition uses against the OPC server. I think that Ignition might have been reconnecting, and that the server may have been complaining that the subscription previously existed. The new version should fix this.

Again, I don't actually think the quota problem is really a memory problem. I think it's caused by a deadlock, which leads to an overflow in the COM buffer. We have to catch and fix the block when it happens.

As for needing to restart, since I can't see that part in the logs, I can't be a 100% sure. It's possible that something might have happened in the transfer of the module to the backup, which would cause the backup to download a full backup the next time the master started. I can't really imagine why you would need to restart the master more than once, though.

I agree with you that the master should wait to request responsibility until it knows the modules are started. We'll have to look at how to make this happen.

The "Tried to perform RPC..." error is something that you might get if the clients/designers were left running when the gateway was restarted. It shouldn't be a big deal, and should only show up once per client or so, but you say that it was happening a lot?

Finally, I saw that the log had a TON of messages about "Errored values were received for some items in the subscription". Do all of your values seem to come in and work correctly after the system is started? This message very well could just be due to the first values sent by the opc server, before real values are available. However, they're being logged because in addition to bad quality, an error code is being sent that means "Bad Datatype". I think it's saying that because it doesn't have the value yet, but I want to make sure the values eventually come in. There is an option in the OPC connection configuration for "request data type" which might help, if the problem doesn't correct itself after startup.

Regards,

fahadfrq · June 28, 2012, 3:08am

Well so far so good, the system is working fine. I have seen a remarkable improvement in success rate of groups which write to OPC Tags also since the upgrade. Thankyou so much for the help with the new module. I have about 10 ignition licenses intended for various plant area projects and I was kicking myself for suggesting upgrade from factorysql.

By “Restart”, i meant, that i had to manually restart the servers since none were active. I only copied the section in the log when it was stuck at the RPC warning error and no other activity was taking place. However, since the restart of both servers after the updated module, everything is OK. I had a planned downtime and needed to ensure that the behaviour was not repeating, hence I restarted the master server multiple times.

Yes please look at reducing the “down time” of Master taking back control from the backup. It seems too logical for it to be missed in such a integral functionality of the system. I do not want to use manual switchover since sometimes due to hardware specs of the backup server, Master should take control always if available.

Yes, the bad values come in the log file, only on startup and then do not occur again. I am ok with that but do you think that “request data type” would improve anything in the OPC cmmunication?

Also what is the recommendation about Java versions? is java 7 better or not, I read somewhere that ignition was compiled in Java 6 but didnt know the recommendations from inductive Automation about support of Java versions or Java Updates. Even though I know Java 7 will work with ignition but which is better?

Colby.Clegg · June 28, 2012, 4:53pm

Hi,

I'm glad to hear that things seem to be working better. Unfortunately the change from FactorySQL to Ignition is a major technological shift, but on the whole we haven't come across anything that would really prevent the transition. The performance of the SQL Bridge is equal to or better than FactorySQL in almost every way. The tricky part, unfortunately, is always with OPC... the technology (COM) and the nuances of each particular OPC Server. We've continued to test the OPC-COM module at the OPC Interop events as we did FactorySQL, but the down side is that generally only the best opc servers go the event to begin with

This is hard to say, because it could simply be something that the OPC server is going to send on startup no matter what. That is, the behavior is 100% defined by the particular opc server, and I would bet that all of these messages are just from one of them, while your other servers don't do it. It could simply be the server's way of sending a "null" value, which in general isn't really possible in OPC. So, if everything is good after startup, I wouldn't worry much.

what is the recommendation about Java versions?

I would recommend choosing Java 6 when possible. We now feel confident that Ignition works perfectly well with Java 7, however, we've found a number of changes in Java 7 that have affected performance a bit. In particular, launching clients/the designer can take considerably longer. We haven't seen anything that really affects the gateway, but overall, Java 6 has a longer and better track record for us.

Regards,

fahadfrq · July 1, 2012, 3:03am

After almost 3 days of working fine, there was a hang up issue again, in which the groups were not executing.

I am attaching the wrapper log from when the problem started to the moment of restart of the server.

I have a standard group (bypassing store and forward) which logs the current timestamp via an expression into a DB table every minute. I then monitor that table and if for the past 3 minutes no new data is logged, I force restart the master server. Wait 5 minutes after restart and then start monitoring again.

please go through the wrapper logs and see if you can get to whatever caused the problem.
wrapper.txt (186 KB)

fahadfrq · July 1, 2012, 3:46am

Apart from the one freeze mentioned above, I am attaching the full wrapper log file, there are 2 instances of “A fatal error has been detected by the java Runtime Environment”

at 30th june 2012 18:25
at 1st July 2012 00:46:55

this made the ignition service restart itself.
wrapper.zip (447 KB)

Colby.Clegg · July 2, 2012, 5:28pm

Hi,

There are a lot of different things going on there that I’ll need to look at. However, one thing that caught my eye: one of the errors was a divide by zero error, which seems to me to be something that would be a problem in the library that we rely on. Could you look on the system and see if you have the file “C:\PROGRA~1\INDUCT~1\Ignition\hs_err_pid2000.log”, and if so, post it here?

There are other problems that talk about “out of memory”, though it’s hard to say whether there is a memory leak, or just not enough memory for the operation at the time (like a big subscription update, or writing to many tags- though I can’t imagine it’s writing to too many at once).

I’m going to take a look at the async write system, which seems to be where the memory leak might be, if there is one.

On a different note, if you want to prevent the logging of all of those value warnings, you can find “log4j.properties” in “{InstallDir}\home\data”, and add the line:

log4j.logger.opccom.wrapper.subscriptions.COMSubscription=ERROR

This would limit logging to errors, which should be fine. I believe only those errored value messages use the WARN level.

Hopefully I can find something soon,

fahadfrq · July 3, 2012, 3:07am

I agree, it seems that at the time of hangup the system generates errors about Async OPC write failures.

Our system does rely on OPC writes but there arent that many since, just around 15 groups which may be collectively writing to 80-90 tags, most write 3 times a day only.

I am attaching the hs_err file, I have around 10 of those already in that folder but am sending you the latest ones. some have Access Violation and some have Divide by Zero, so am sending you the latest one of each.

Also this morning again, there was a hangup and my custom app had to restart the master server, just before hangup the some OPC write failure messages were logged which you can find in the wrapper log attached. at 5:11 AM the messages came and then at 5:14 AM the server was force restarted, no groups executed from 5:11 AM.

please look into it, since everytime this happens, i still loose about 3 minutes of important data and the users are more interested in the 3 minutes lost rather than the 1437 remaining minutes of the day but its their call.
hs_err_pid5204.log (16.8 KB)
wrapper.txt (8.74 KB)
hs_err_pid4136.log (17.3 KB)

fahadfrq · July 3, 2012, 9:31am

Ok so this morning at 10:38 AM, the master server stops executing groups, my custom app restarted it, it would come up and still not initialize and so kept on restarting…

at this time, even the bacup server was in “failed to post message, not enoughquota is available” error constantly.

the problem was resolved when I restarted the backup server and then the primary server. Suffice to say a very very bad situation.

i am attaching the thread dump also which had some blocks along with the wrapper logs of master and backup. in the master log, the restarts are due to the custom app.

we went into factorysql to replace RSSQL and FactoryTalk transaction manager and for more than an year factorysql perform superbly in repalcing and enhancing our datalogging requirements. With ignition, these problems of “unexplained” data loss of the same project are really difficult. please help.
backup wrapper.zip (84.9 KB)
wrapper.zip (293 KB)
thread-dump.txt (48 KB)

Colby.Clegg · July 3, 2012, 8:04pm

I’ve managed to replicate some strange behavior, but it will take me some time to investigate. I just want to let you know that I am actively working on this, and that it is a high priority. I’m fairly convinced that there is a deadlock somewhere in the OPC system, which is leading to the “not enough quota” issue. I’ve managed to make it happen a few times, so hopefully I can track it down soon.

Regards,

fahadfrq · July 4, 2012, 2:52am

thankyou for attending to this. if you need any additional logs or data, let me know.

Also please look into the errors on the master server i.e. when it stops the groups execution with Access violation or divide by zero error.

I have disabled the backup and made the master independent so that on force restart, the master server, does not depend on the state of the backup to initialize properly.