Database Connection Failing with high CPU Usage

Brett_Yeomans · July 2, 2012, 2:48pm

We have a problem where the database connection keeps faulting. Existing graphs fail to load and if left on screen “Loading” the connection faults with “Cannot get a connection, pool error Timeout waiting for idle object”. The alarm history does the same thing. When this happens we also see Gateway Connection Lost popping up on the screen.
We are running 7.4.1 and this is not a large project.
The gateway has also restarted itself.
The CPU goes up into the 90%'s and RAM is at this moment at 1003mb
I noticed if a graph was opened and then closed another “Active Connection” appears in Database Connections and these can have an “Age” of 20 or 30 minutes
The server has 250Gb of hard drive available

Any ideas on where I should be looking to track down the fault?

greg.simpson · July 2, 2012, 3:02pm

Looking at the logs and console could give you an indication of what is going on. Have you tried restarting the gateway?

Colby.Clegg · July 2, 2012, 4:53pm

Can you post the text of the long running query? When the system is under high load like that, go over to the active queries page, and enable live values. If your connection settings are the default, there will be 8 queries running at once. Taking a screen shot of the long running queries might help us track it down.

Regards,

Brett_Yeomans · July 3, 2012, 1:45pm

Discovered that the SQL table being used by the graph for ‘Chart Configuration - Tag Pens’ had a duplicate/identical row (no primary key) and this I believe caused the issue.

After the SQL table was altered two connections did re-appear like an hour later aalong with a SQL connection lost then restored message. One stayed for an hour the second disappeared after a couple more hours. (screen shot attached).

We have been having these SQL database connections lost before this issue, but could the above have been a hang up from the Noob SQL table mistake?

Michael.Stofan · July 3, 2012, 3:50pm

It’s most likely that the non-index issue was the source of the slow queries. When there’s no index to go on, the database will have to perform full table scans in order to meet the requirements of the ‘where’ clause. However, keep an eye out for any more issues and let us know if you continue to have problems.

Colby.Clegg · July 3, 2012, 3:58pm

Hi,

Those queries are part of the tag history system, reading the data for a query. I suspect that one way or another queries are being fired off for time ranges that you might not expect. To confirm this, you can go to Logging and set the “History.SQLTags” logger to Debug (note: when you set it, it will also set the sub-loggers to debug. Those you can set back to info). This will log the parameters of each query that comes in. Then, after one of these long running queries happens, you can take a look at the log and paste it here.

I believe there was some issue with the easy chart firing unnecessary query requests on open, but I think that was in 7.3, not 7.4. Still, you might try updating to the latest 7.4.

Also, what type of database are you using? There’s no reason those queries should take that long to begin with, so maybe we can troubleshoot on the db side as well.

Regards,

Colby.Clegg · July 3, 2012, 4:00pm

Also, just a side note from the band-aid category: if you go into your database connection and increase the number of allowed connections (default 8, you can easily put it at 25 or so), you’ll reduce the errors you get from components trying to run queries when the connection pool is full. I’m not 100% sure these are the errors you’re seeing, but it shouldn’t hurt much to increase the pool.

Regards,

Brett_Yeomans · July 5, 2012, 9:30am

To follow this up, the long running ‘Active Connections’ have gone and not returned but it was noticed the CPU was running at high percentages.
We also had some new devices which refused to connect or browse.
After a gateway restart the System console was full of ‘java.lang.Exception: Unable to add data to memory buffer, buffer full’ faults
I also noticed in the ‘Configure - Store and Forward’ was a Unable To Create Table’ fault referring to a tag history table for 6 months ago.
Reading in another post it was mentioned to rename a datacache folder so this was done and now the CPU is <1% and everything looks OK, all the new devices have connected.

Any advice on what has caused all of this would be appreciated as we were fortunate there is no data that is essential yet and I guess we need to learn how to avoid these problems.

Colby.Clegg · July 5, 2012, 6:43pm

Hi,

For some reason, data was being stored to the data cache. The cause may have been temporary or not, but either way the data would have been quarantined. If the cause wasn’t temporary, quite a bit of data might have been stored over time. Unfortunately, we’ve observed that as the cache accumulates more data, it takes longer to pull data out. I suspect that you got into a situation where even good data was first going through the cache, resulting in a lot of cpu use.

The best way to avoid this would be to reduce the max size for the cache, and to try to keep an eye out to make sure it doesn’t get too big. One thing that can help is to set up SQLTags & alerts that watch the quarantine size. That data should be part of the gateway status tags, but I guess it’s currently not. However, there is an expression function that you can use:

getQuarantinedRowCount("DbConn")

(There’s also “getCachedRowCount”)

So, you can create an expression tag that runs that, and alarm or show a message when it’s greater than 0, or some limit.

Hope this helps,