Gateway leaks database connections

systemparadox · November 21, 2012, 5:13pm

Our Ignition gateway is randomly but infrequently deciding it only wants to use 1 database connection, even when MaxActive is set to 8.

If it was 2 or more we might not notice so much, but the single connection is consumed as soon as a transaction is opened. If the code running the transaction also tries to do something outside the transaction or in a different transaction, it blocks until either a TransactionTimeout or a timeout waiting for the connection pool.

Whilst it is waiting for the timeout, the status for the database connection shows that 1/8 connections are used.

This problem may be fixed by going to the edit screen for that connection and pressing save (no settings need to be changed). It will now use all 8 connections, but there is no way of knowing when it will next decide to misbehave…

Notice that this is connection-specific. We have several database connections defined in Ignition, using different databases on the same MySQL server. We have had the problem on every single one at some point, but they do not affect each other. If one is broken, others may or may not be broken. Fixing one does not fix another.

I originally put this down to some sort of migration issue (we had imported all gateway settings and projects from 7.2 into a fresh 7.5.1 installation), but this has now happened a couple of times since (over a period of about 2 months).

Ignition 7.5.2 gateway running on Ubuntu 12.04, OpenJDK 1.6.0_24.

I’m hoping someone is going to say that this is a known problem that has been fixed in 7.5.3 or 7.5.4.

Thanks.

Travis.Cox · November 21, 2012, 6:18pm

You shouldn’t use the OpenJDK. You should really install Sun’s Java.

systemparadox · November 22, 2012, 8:53am

Could you provide some technical justification for that please, and how it relates to this bug?

systemparadox · November 22, 2012, 9:15am

I’ve just run a test on all the databases and they currently all have varying numbers of connections available. MaxConnections is set to 8 for all of them. It would seem that the pool is slowly losing them somewhere.

Kevin.Herron · November 22, 2012, 4:25pm

I don’t think this is an issue we currently know about/have fixed.

That said, we are shipping relatively old versions of both the MySQL JDBC Connector and of Apache commons-dbcp and commons-pool.

You can upgrade the MySQL connector to the latest version yourself. We will have to investigate upgrading the commons-dbcp and commons-pool libraries - it may be something that needs to wait until 7.6 to do safely.

Give upgrading your MySQL JDBC a shot and see if it helps though…

systemparadox · December 12, 2012, 3:15pm

Problem persists with Ignition 7.5.4 (b1206).

I’ve just updated to mysql-connector-java 5.1.22, will see if that fixes it…

Kevin.Herron · December 12, 2012, 4:08pm

[quote=“systemparadox”]Problem persists with Ignition 7.5.4 (b1206).

I’ve just updated to mysql-connector-java 5.1.22, will see if that fixes it…[/quote]

It would persist - nothing has changed.

Colby.Clegg · December 14, 2012, 11:13pm

Perhaps a thread dump taken while the problem was occurring could be useful. I’d be interested to see if something is blocked in the connection pool’s “create internal connection” (or whatever it’s called) function.

If connections are only being retrieved through Ignition’s db manager (specifically, through GatewayAPI’s Datasource.getConnection), they should show up on the status page, under the ?/8 counter. Off the top of my head, I don’t know how you would get one otherwise, but if you bypassed that and somehow got straight to the pool, I guess you could get a connection and never close it, leading to this type of behavior. However, nothing in Ignition would do this. That is, we certainly have had problems with connections not being closed before, but they always show up on the status page.

We had a problem earlier this week with SQL Server that was solved by upgrading the JDBC driver, so who knows, there could be something there. Let us know if the update seems to help.

Regards,

systemparadox · December 17, 2012, 3:22pm

Problem persists with mysql-connector-java 5.1.22.

There appears to be some correlation between the rate of connections being leaked and our development activity. I reset all the connections on Wed. Thurs/Fri we only had one person doing some minor fixes to a project we don’t normally use. The database for that project has leaked a single connection. Our main project (which normally loses connections faster than the others) hasn’t lost any.

This would also imply that contrary to my original suspicions the issue is not due to any kind of nightly backup activity.

Colby.Clegg · December 17, 2012, 4:36pm

Are you monitoring the connections through the MySQL Workbench? Did you try grabbing a thread dump while connections were blocked, to see if perhaps they were blocked inside of a particular function?

I’m not sure how well this would work in practice, but theoretically could be helpful (especially if the connection gets stuck fairly quickly): MySQL has a “general log” which will log every statement, along with the connection id. If you were to turn that on, and then watch the connection list until one of these lost connections occurred, you could note the id of the connection, and go to the log to see the last statement executed. This would help us figure out where the query was run from (that is, maybe the query would look familiar, either to you or to us).

The difficulty, of course, is that with so much activity that log could get big FAST. But, it may be worth trying for a little bit. I linked to the docs above, but in short, you can add this to “my.ini” (if on windows), under the “[mysqld]” section:

[mysqld] ... log = "c:/path/to/general.log"

Regards,

systemparadox · December 17, 2012, 5:10pm

I don’t know what you mean by monitoring with MySQL Workbench, but SHOW PROCESSLIST shows that affected data sources are simply maintaining the wrong number of connections.

For example, my test script shows that one of the data sources will only use 7 of the 8 connections. SHOW PROCESSLIST shows only 7 active connections. If KILL CONNECTION is used, the data source recreates the killed connections, but only up to 7.

The thread dump is attached. If that doesn’t give any hints then I will see about enabling the general log.

Thanks.
thread-dump.txt (85.2 KB)

Carl.Gould · December 17, 2012, 6:53pm

Ok, it really seems like this is an issue within the connection pooling software we use (Apache DBCP).

Since you’re able to reliably reproduce this issue, I wonder if you wouldn’t mind doing a bit of an experiment for us. (I tried it already, so I know it works. Whether or not it fixes your issue only you’ll be able to tell)

I want you to upgrade the Apache DBCP and Apache Pool libraries to their most current versions.
Download the DBCP 1.4 binaries here commons.apache.org/dbcp/download_dbcp.cgi
Download the Pool 1.6 binaries here commons.apache.org/pool/download_pool.cgi

Now, under your contexts/main/core/gateway folder, remove the commons-dbcp-1.2.2.jar and commons-pool-1.4.jar files. Replace them with the updated version you downloaded.

You should do all this while the Gateway is shut down. Then start it back up and see how it works.

systemparadox · December 18, 2012, 11:45am

With upgraded DBCP and Pool everything still works, but appears to still leak.

I note also that it’s lost a connection on every database this time, even those which haven’t been used.

Carl.Gould · December 18, 2012, 5:06pm

Hmm…different but still broken huh. What’s the timeframe on these “leaking” connections? How quickly do they leak? How regular is it? How are you testing that a connection has leaked?

At this point we need to start identifying differences between your system and every other customer’s system that isn’t displaying this behavior. Do you have any modules written that interact with the database connections?

systemparadox · December 19, 2012, 10:37am

I suspect the different results are not due to the upgrade, I just don’t have enough data available. This will be much easier if I can find a way of triggering this manually.

VERY approximately one connection is lost every two days. Sometimes every data source loses one, sometimes only one or two data sources lose one. At this point I am unable to see any patterns. Every time I identify a possible pattern, the next day I see a counter-example

We have 3 custom modules on the system and all of them interact solely with script modules, never with the database.

The number of available connections is tested using the following hack:

databases = [
	'foo',
	'bar',
]
results = {}
timeout = 5500
for db in databases: 
	i = 0
	try:
		while i < 8:
			system.db.beginTransaction(db, timeout=timeout)
			i += 1
	except:
		pass
	finally:
		results[db] = i
system.gui.warningBox("<html>Results:" + ''.join(["<br>%s: %d" % (db, i) for db, i in results.iteritems()]))

Yes, this blocks the main thread and consumes all the connections until the timeout is triggered. It’s horrible.

The results of this are confirmed by:

SELECT db, substring_index(host, ':', 1) AS hostname, COUNT(db) FROM information_schema.processlist GROUP BY db, hostname ORDER BY db, hostname;

I’ve turned on the idle connection evictor for one of the data sources and cranked it up to genocidal to see if that makes any difference.

I have also noticed that saving the settings for a data source does not release any connections it already has. For example, if a data source is holding 8 idle connections in the pool, pressing save creates a brand new connection, bringing the total in the process list to 9. Running the test script to exhaust the pool causes it to create another 7, bringing the total to 16, of which only 8 are used. The original 8 have simply been forgotten about. Restarting the gateway causes all these connections to finally disappear. I temporarily reverted DBCP and Pool but this behaviour remained.

systemparadox · December 20, 2012, 12:07pm

This issue also occurs on my local development gateway installation on Ubuntu 12.10, so is not specific to our Ubuntu 12.04 server.

My local gateway started up at 8:30 and when I checked it at 11:55 it had lost one connection on every data source even though no clients or designers had been started.

I have also noticed that data sources with the idle connection evictor switched off still have their connections cleaned up eventually (either by time or when all designers/clients are shut down, haven’t worked this out yet). Maybe this is related?

systemparadox · December 21, 2012, 9:36am

Somehow Ignition is clearing active connections once everyone has logged out… of their window manager! Closing all clients and designers and all web browsers doesn’t trigger it, but the second I log out it cleans them up. I compared the process list before loading the client and after it is shut down but there are no new processes left that could be hanging on to a gateway connection. Wireshark shows that port 8088 is all quiet after the client is shutdown.

How does the gateway know?

Even data sources with initial_size and min_active set to 8 (rather than the default 0) get reduced to a single connection. These are not restored when a client connects. They are only recreated when there is demand for them, exactly like data sources with min_active set to 0.

Data sources with the connection evictor switched on leak less, but do still leak.

systemparadox · December 21, 2012, 11:29am

Right, I’ve found the problem.
KDE 4.2 uses MySQL as the default storage engine. Normally this uses the local server, but I have a ~/.my.cnf setup to connect to the remote server automatically and it would appear to be using that (which is not helpful). Upon logging out, something goes wrong and it causes the MySQL server to crash.

However, it still shouldn’t be leaking.

So there are 3 bugs here:

(probably) KDE is doing something stupid with the MySQL connection
The MySQL server is crashing instead of closing the broken connection
Ignition (or more likely Apache DBCP/Pool or connectorJ) is not handling the crash correctly and loses track of how many connections it actually has. It also fails to recreate connections to satisfy the min_idle setting.

KDE already has a bug open for incorrect use of my.cnf: https://bugs.kde.org/show_bug.cgi?id=264383.

Carl.Gould · December 26, 2012, 6:03pm

Wow, that came out of left field.

I wonder if it’s ConnectorJ or DBCP that is at fault here… maybe there are some DBCP loggers we can inspect to shed some more light on the issue

systemparadox · January 2, 2013, 9:00am

Some further investigation reveals that since KDE thinks its using its own server it issues a shutdown command. So there is no bug in MySQL, or KDE (other than using the wrong server entirely).

I can reliably reproduce this problem by doing a “mysqladmin shutdown” or “service mysql restart”.

It is clear that Ignition (or Pool/DBCP/ConnectorJ) really can’t cope with the MySQL server going down for any reason, even if it is shutdown gracefully. Sometimes some of the datasources don’t come back online at all without manual intervention.