FSQL Deadlock

chache · June 16, 2011, 1:50am

Hey All,

First off, I know FSQL is old school, but it’s not up to me what version of what we run

This afternoon we had a communication failure knock out many sites we have. All these sites have alarms configured, and they all generated the proper alarm except for 1 site. The FSQL log facility had only 1 entry in it, at the time of the failure. I’ve recreated the log here:

Error executing update query: Deadlock found when trying to get lock; try restarting transaction at FactorySQL.LocalDbOperationProvider.ExecuteParameterNonQuery(DBConnectionID ConnectionID, String Query, IDbDataParameter[] Parameters, Int32 Timeout) at FactorySQL.Alerting.EnhancedAlertLogTable.InsertChangeInDB(blah, blah) MySql.Data.MySqlClient.MySqlClient.MySqlException: Deadlock found when trying to get lock; try restarting transaction at MySql.Data.MySqlClient.PacketReader.CheckForError() at MySql.Data.MySqlClient.PacketReader.ReadHeader() at MySql.Data.MySqlClient.PacketReader.OpenPacket() at MySql.Data.MySqlClient.NativeDriver.ReadResult(blah, blah) at MySql.Data.MySqlClient.CommandResult.ReadNextResult() at MySql.Data.MySqlClient.CommandResult..ctor(blah...) at MySql.Data.MySqlClient.NativeDriver.SendQuery(blah..) at MySql.Data.MySqlClient.MySqlCommand.GetNextResultSet(bah) at MySql.Data.MySqlClient.MySqlCommand.Consume(bah) at MySql.Data.MySqlClient.MySqlCommand.ExecuteNonQuery(bah) at FactorySQL.LocalDBOperationProvider.ExecuteParameterNonQuery(blah) at FactorySQL.LocalDBOperationProvider.ExecuteParameterNonQuery(blah) at FactorySQL.Alerting.EnhancedAlertLogTable.InsertChangeInDb(bah)

It looks like the error could have come from MySQL, but the MySQL logs don’t show anything around that time.

Anyone have any ideas on how to track down the root cause and/or ensure that the issue won’t happen again? (It makes me look bad when obviously all the sites are down but there is still one green dot on the map )

As an aside, I couldn’t select the log text from the log viewer, so I couldn’t copy paste it here. Also, when I tried to Export To File the log, it just created a 3 line XML file with none of the logging information. Are there tricks to getting the log data out from the log viewer without retyping it all?

Thanks in advance!

chache · June 16, 2011, 1:51am

Oh yeah, we’re running FSQL 4.2.13.0.

Colby.Clegg · June 17, 2011, 10:34pm

Hi,

You’re right, that it is from MySQL. The problem is, though, that it’s not considered a “serious” error, so MySQL wouldn’t say anything. Between reading and writing, these types of deadlocks can occur, and the database detects them, throws an error, and keeps going. The database assumes that the program will just try again, which is too bad… because we can’t always do that (or simply don’t).

There are ways to track down what exactly is happening in Mysql, but I think I can guess: there are multiple indexes on that table, and you have several (many?) clients polling it for information. Since you’re using the data for display, I know you’re doing some sort of querying. Frequent querying, or unoptimized queries that take some time, can lead to many more of these deadlock errors when the other side tries to write.

Here are some things to look at:

What indexes are defined for the alert log table?
How many rows are in the table?
How frequently are you querying the table? How long are the queries taking?

I know, you’re probably saying to yourself “what the heck! I don’t know any of that!”, but here’s what you can do:

Use the mysql query browser or administrator, and find the alert log table (probably “alertlog”). Right click and select “edit”. This should bring up the table structure, and show you what indexes are defined.
Run “select count(*) from alertlog”, with the proper table name, of course.
Find the components that are bound to the database. Look at how they get their state. The will probably be set to “poll” at some rate- if it’s “relative”, it’s probably 1 or 2 seconds. Also, copy the query out and run it in the mysql query browser to see how long it takes.
In the MySQL administrator, go to the connections panel, and hit refresh a number of times. If you see any queries stick in the list for a few clicks, they’re taking too long.

Post back here what indexes you have, and what the queries are doing. Perhaps we can find a way to remove some indexes, or make them better so the queries run more quickly. You can also possibly reduce their polling rates.

As for the log viewer, it’s a little annoying, but you should be able to right click, “select all” and then right click and “copy”. You won’t see the highlight, but it works for me.

Good luck, and let me know what you find.

chache · June 17, 2011, 11:20pm

Hehehe… I’ve made that table a bit ‘messy’ to ‘speed it up,’ which is the problem most likely.

Basically we have a map with dots for sites, and the colour of the dot implies if there any unclear and/or unacknowledged alarms. The original problem was that FPMI would have to do several SQL calls per site, and the number of sites we’re monitoring is ever increasing. When your on the same network as the server, the map loads pretty quickly, but over a VPN the map screen started taking 30+ seconds to load.

My solution was to pre-compute some of that work. I added a table with the colour that each site should be, and added a trigger on the alarm table. When an alarm is generated (or updated, or deleted), the trigger calculates what the new colour for the site should be.

It’s worked very well so far: the map loads in about 5 seconds over the VPN, which is fairly acceptable. We’ve never encountered a situation where many alarms were generated simultaneously, so I’ve never seen any issues with the extra overhead of the trigger.

I guess my hands are tied - I’d have to choose between making the map screen load fast or making inserts/updates to the alarmlog fast. When I get a few minutes I’ll likely try benchmarking the DB to see what sort of overhead the trigger is causing. I’ll also reduce the polling rate for the dots on the map, that should alleviate some load too.

Is there a way to try and get FSQL to redo the transaction if it fails? When I first noticed the problem, I couldn’t get the alarm to trigger by stopping and restarting the group. I guess the alarm is only triggered on the transition from ‘good’ to ‘bad’ states, so starting in a bad state wouldn’t trigger an alarm. Also, is there possibly a way to get FSQL to poll sites in more of a round-robin style vs all together? The sites all generated alarms in groups (since the polling was set to 10 minutes) so we got about half the alarms at 12:39 and the other half at 12:49. Is there some configuration to spread out the load?

Thanks

Colby.Clegg · June 17, 2011, 11:50pm

I’m curious what the SQL statements you’re using to build the state look like. Perhaps you didn’t explain the full scope of what you’re doing, but if it’s really just checking how many alarms are active at each site, it seems like there should be a way to do it in just one query, meaning that it shouldn’t take much longer over the vpn than locally. And if that were the case, it should be possible to optimize the indexes so that the whole thing was faster.

Also, deadlocks can be triggered by accessing keys in a different order between queries. If you post the queries you’re running, we can compare them to what FSQL runs. Also, post the indexes that are defined for the table.

Triggers are a bit notorious for causing this type of problem. But, I suspect it all just comes down to optimizing the query that builds the state.

In regards to retrying the query, there isn’t really a way right now to get FactorySQL to do it. Not much consolation, I know, but Ignition would… the store and forward system there works differently.

Regards,