Postgres connection pooling issue

gekkoman · July 4, 2017, 5:13am

Very regularly we see our charts flash red which is visually highlighting a connection problem…a second later it resolves. The longer the uptime the higher the frequency…it has been a real PITA for a while as there seemed to be no rhyme and reason to it.,…restarting ignition makes it go away for a while but then it comes back.

Sure enough the logs show “Error executing Historical Tag Read” matching the chart issues. There are other errors and symptoms as well…but this as it turns out is enough to quantify the problem.

Have now done a whole lot of tests and have managed to narrow it down to a point where I can reproduce at will.

Are running 7.9.0.
I have already updated jdbc driver to latest just in case (no change in behaviour, but as out DB was later version than jdbc driver probably prudent)
Our connection pool is set as unlimited but in reality never runs >15 connections, and typically <5
If you restart the DB connection all is good…no gremlins everything as expected
If you run a sql that fails (I have some scripts that I am debugging in designer for example), the problem starts
If I restart DB connection problem goes away

So it seems that defunct connections are being retained in the pool as valid connections, so everytime this defunct connection is selected from the pool the sql fails (hence the itermittancy)

This can also be demonstrated in large number of quarantined events in the store and forward system…As it turns out you cannot rerun the quarantine successfully with the bug…every so often a query hits the bad connection and adds to the list.

If you however reset database connection (simply opening setting in web interface and re-saving does the trick) you can then clear the quarantine successfully.

It also explains why we sometime see spotty data collection…eg instead of getting nice lines in the charts we see sporadic missing data which gets worse with time (likely as more and more connections become defunct)…probably builds up in the store and forward…it was maxed out when I look at it.

One would love to see a solution to this.

gekkoman · July 4, 2017, 5:26am

I may have found a workaround…

I have engaged the evictor thread to run every 500ms with test when idle…when the fault occurs it appears to clear immediately.

I can now clear quarantined items without having to restart db which is a very positive sign. Still early days ( literally 10 minutes) but signs are looking good.

gekkoman · July 5, 2017, 10:42pm

This is actually a critical bug that needs urgent attention…when you lose data as a result it is far from ideal. The cart below shows the effect of the bug on data.

The evictor only helps the issue but the root cause remains.

pturmel · July 6, 2017, 12:27am

Are you using “Test on Borrow” in your connection, and if you log statements in pg, can you see these tests succeed?

gekkoman · July 6, 2017, 5:25am

Test on Borrow is on.

Tests appear to succeed when problem extant.

Am considering using pgpool II in order to drop default postgres pooling which is documented as having this potential issue.

pturmel · July 6, 2017, 12:28pm

Ignition uses apache-commons-dbcp connection pooling, not the postgres jdbc implementation.

Colby.Clegg · July 6, 2017, 5:16pm

Hi,

What kind of errors are occurring when this fails? Any kind of messages from the DB might be useful. The default validation query is just “select 1”, I’m not sure it why it would succeed when other queries wouldn’t, but perhaps you could try changing it to something that actually referenced your schema. Or is it that SELECTs work but updates/inserts don’t?

gekkoman · July 19, 2017, 11:06pm

I believe I may have finally found the cause of this issue…was a schema tangle.

Not all SQL includes a full schema.table syntax…It looks like some connections were being released back to pool defaulting to a different schema which then causes previously working non fully qualified queries to then fail when they reuse these connections. Since a"Select 1" will still pass validation these connections are not pruned creating the persistant intermittant behaviours observed.

After fixing the schema issue I now have a few days of logs without these errors.

It is probably a good idea to standardise on fully qualifed queries going forward since these make no assumptions on default schema (most queries are already but it appears there are exceptions)