MSSQL 2005 & Data Quarantine

I had an issue that I wanted to put out to the community to see if others have seen it and how they handled it.

Running Ignition 7.1.5 on a Win2003 Server in a VMWare VSphere setup.

MS SQL 2005 running on hardware in Windows 2008 Server 64bit.

I have several projects running. 102 transaction groups (all but 20 of them as Historical Groups). a total of 524 OPC tags. All of these groups are either logging on a trigger (maybe once every 15 mins), or on 10, 30 second or 1 min triggers. I have a group of alarms that are only using alerts. In reality, there isn’t much going on here in terms of pounding anything.

I had an issue with my SQL log file filling up and running out of space the other day. This stopped allowing anything to be logged to the DB. This was my fault as the DB was set to Full recovery mode rather than Simple. This has been fixed.

But

For the duration that the database was down and unavailable (approximately 12 hours), the store and forward system cached and then quarantined all the data being logged. Once the database came back online, it was only partially logging anything to the database. Most unfortunately, I was out of town and couldn’t do anything about it. My IT guy changed the recovery mode to get it back up.

The system ran this way for 2 days. When I was finally able to get online with the system 2 days later, I found over 80K of records in the Quarantine.

I went into the system and told everything to retry. It was queued up as pending transactions in the local cache. This was 6pm on Wed. At 5am Thursday, there was still 80K of records pending there. Also, my server running ignition was bogged down to the point where I almost couldn’t look at task manager to see what was going on. Eventually, I found that the server was running between 60 & 90% of the CPU and java.exe memory had 1.2GB allocated.

At that point, I went into ignition and took all my transactions offline as I absolutely had to get the pending transaction in the database. At that point, it started inserting them at a rate of about 10K per hour and it sped up so that by 10am, the local cache was empty.

As a last item, java.exe memory usage was still at 1.2gb and it took stopping and restarting the gateway to release it.

This doesn’t make me feel real comfortable, to say the least.

My 1st question is why did 80K valid transactions get dumped into quarantine? I would expect that if it were indeed bad information, it would have just gotten dumped back into quarantine. It didn’t. I would also expect that this data would have been held in the local cache until the database came back online and then automatically would have been inserted into the database. Again, it didn’t.

2nd, why did emptying the local cache suck up as much memory and cpu usage as it did?

All of my store and forward settings were at defaults except for the max records which I set to 500K a few weeks ago.

Any thoughts would be appreciated as this is just a small part of the data collection system and it will be expanded as time goes on.

Thanks…

Hi,

Yes, that certainly doesn’t sound very good. First, I’ll give my explanation, and then talk about some points where we can maybe improve things in the future.

First, why the data was quarantined: The problem is, when SQL Server is in that state, the connection is fine, but the queries error out. Ignition can’t tell what kind of error is happening in the database- does the table not exist? Bad data? Or… out of disk space. So, after trying several times, it determines that the query should be quarantined.

Next, why the quarantine load was slow: Data is cached in an internal database. It has to be read out of the database and forwarded- however, as new data is generated, it continues to get written into the internal db. I suspect that this leads to a lot of inefficient locking, resulting in both operations taking more time. This would explain why the data loaded faster after stopping the groups generating data.

Unfortunately, as for the memory, I don’t really have a good explanation. We’ll have to try to investigate more here on our side.

Now, how things can be improved:

  1. There isn’t a quick fix for the quarantine situation. If the query errors out, we just can’t really know why (if we were only programming against one database it would be easier). The best thing to do would be to make the store and forward engine more analytical, looking at the type of transaction being performed, and quarantining “transaction types” instead of actual queries. Then it could periodically try to rerun the first transaction of each type, automatically unquarantining that data when appropriate. The quarantine is really intended for situations that require manual intervention, such as a missing table. It is, of course, our ultimate goal to avoid requiring intervention when it’s not necessary.

  2. We will look into performance characteristics when storing and loading concurrently to/from the cache.

  3. Something that you can change yourself to make things a bit faster when loading from the cache: set the Forward Settings “write size” to something higher, like 250 or even 1000. This number dictates when the rows should be sent (along with the “write time”)- but it also acts as a hint for how many pieces of data to include in the transaction. Your 10k/hour number would indicate an average transaction cost of 150ms, with the default size of 25. This is probably more or less right (still a bit slow compared to all of the tests we’ve run, but not horribly). The thing is, you could probably fit 100 records into the same transaction, making the insert 4x faster.

Hopefully this answers some of your questions. The store and forward system has, on the whole, performed well so far from what we’ve seen, but we also have lots of plans for it. There are many customers planning/building big hub and spoke data collection systems, so having the store and forward system be highly efficient is a top priority. I suspect the improvements mentioned in #1 above would be made along the 7.2 timeframe (early Q4).

Regards,

Thanks Colby…

1st, let me say this… on the store and forward… while it required my manual intervention, it did exactly what it needed to do… it saved me from losing all the data generated while the database was off line… we have 2 shift reports that we print off automatically… once the data was finally in the database, I had it all and was able to just manually print off the reports that were missed… so really, it saved my butt…

I think one thing short term that would be beneficial would be a ‘retry all’ in the quarantine section. That way, I could at least do it with 1 button press. Also, when looking at the quarantine, if the information is longer than the preset width of the window that ignition generates, I must page all the way down, move it over with the scrollbar and then click on retry. This is a real pain. Same thing happens throughout the Ignition configuration windows (i.e. the project screen, etc). Is there anything you guys can do about this?

On the determination of my specific SQL problem and why things would be sent to quarantine (and I don’t mean to sound flippant or anything)… 1st, it’s pretty easy to tell when something is malformed or incorrect on my end. During development, it either works or it doesn’t, so I know when I have done something wrong with my queries and configuration…

2nd, when I think about the problems I could have with the SQL server… well, it was there and was working at one point, so either the computer it was running on got turned off or crashed, or the Database itself is in some state where it will not accept data. In MSSQL 2005 at least, I can query the master DB sys.databases view and it will tell me what the state of the DB is. In my case, it was ‘LOG_WAITING’ error code 2 which translates out to a full log file. Obviously, I will need to put something in my apps that lets me know the state of the DB, but how possible is it to provide a hook into the store and forward that would allow us to say ‘don’t quarantine cause we know what the problem is’, i.e., the log is full and once it gets taken care of, just start forwarding everything you collected back into it rather than making me do it manually. Just a thought…

Now, on the store and forward settings, I found this post:

http://www.inductiveautomation.com/forum/viewtopic.php?f=72&t=5422&p=13542&hilit=store+forward#p13542

and set it up as you described. We will see how that goes.

The memory and CPU usage is really concerning me. If this is happening with such a small amount of tags and only a couple of people accessing things, I’m going to be in trouble when I really start using the system. I’ll be honest and say that I am absolutely no java fan but only tolerate it because I like Ignition so much…lol… but if there is anything you can do to look into it, it would make me happy. If you need to get into my system to see what’s going on, just let me know and I’ll make it available to you.

Thanks for responding…

Hi,

Yes, we should definitely make the quarantine page easier to view/use. We’ve noticed several parts of the gateway that are difficult to use when there is lots of data, or particularly wide descriptions, and should get that fixed soon.

In regards to the “quarantine or not” decision: I left a lot out of my description of the potential upgrades to the store and forward system. When I started writing it, I originally wrote that it could be more analytical, and realize that queries that were previously working are now failing, and likely shouldn’t be quarantined. However, this case is still not 100% true- sometimes it’s an issue of the data instead of the query- writing a value that’s out of range for the column, for example. I think there’s a lot of progress that can be made, but there’s probably a lot of ins and outs. Providing some way to tie in your own logic to determine if the database can accept data might be worthwhile option.

Finally, the CPU/Memory usage - your last post made it seem that the high usage was only during/as a result of un-quarantining the data. Is it taking a lot of memory/cpu during normal operation? 524 tags and 102 groups running between 10 seconds to 1 min really shouldn’t take too much memory or CPU. We definitely can take a look a your system if you want, and see if the normal usage makes sense or is way off for some reason.

Regards,