Best practice csv to SQL import

robert1 · December 11, 2019, 5:59pm

Hello-

I have a script that opens a .csv with somewhere between 5-10,000 rows, and then inserts this into a SQL table. It works just fine but runs fairly slow (I think). The script takes about 15-20 seconds to execute. I’m wondering what best practices I would use to make this run faster if 15 seconds is much slower than you would expect that kind of task to take? I imagine the issue is calling system.db.runPrepUpdate 5-10,00 times? Is there a better way to get all this data inserted at once?

Here’s a simplified version:

import csv

# get file path of upload and then convert csv data to dataset
csvPath = event.source.parent.parent.getComponent('Container').getComponent('File Explorer').selectedPath

# open selected csv file
csvFile = open(csvPath, 'rU')
print csvFile

# read selected csv file
csvRead = csv.reader(csvFile)
csvDataset = []
for row in csvRead:
	csvDataset.append(row)
csvFile.close()

# get upload time
uploadTime = system.date.now()

# start db transaction
txID = system.db.beginTransaction('Quality', timeout=10000)

# delete all orders from table
query = "TRUNCATE TABLE inventory_dump"
system.db.runUpdateQuery(query, 'Quality', tx=txID)

# get inventory timestamp
inventory_timestamp = event.source.parent.parent.getComponent('Container').getComponent('inventoryDate').date

for row in range(len(csvDataset)):
	if row > 0: # ignore header row
		item_number = csvDataset[row][0]
		quantity = csvDataset[row][3]

		# insert orders into sql table
		query = "INSERT INTO inventory_dump (dump_timestamp, item_number, quantity, uploadTime) \
				 VALUES (?,?,?,?)"
		args = [inventory_timestamp, item_number, quantity, uploadTime]
		system.db.runPrepUpdate(query, args, 'Quality', tx=txID)

paul-griffith · December 11, 2019, 6:28pm

Something like this would probably be a lot faster - just bundle the INSERTs into a single statement. A good improvement would probably be to ‘batch’ those inserts into chunks of 1000 or so; see the grouper function in the examples in the itertools library:
https://docs.python.org/2/library/itertools.html

import csv

# get upload time
uploadTime = system.date.now()

# start db transaction
txID = system.db.beginTransaction('Quality', timeout=10000)

# delete all orders from table
query = "TRUNCATE TABLE inventory_dump"
system.db.runUpdateQuery(query, 'Quality', tx=txID)

# get inventory timestamp
inventory_timestamp = event.source.parent.parent.getComponent('Container').getComponent('inventoryDate').date

# get file path of upload and then convert csv data to dataset
csvPath = event.source.parent.parent.getComponent('Container').getComponent('File Explorer').selectedPath

query = "INSERT INTO inventory_dump (dump_timestamp, item_number, quantity, uploadTime) VALUES {values}"

# open selected csv file
csvFile = open(csvPath, 'rU')
print csvFile

# read selected csv file
csvRead = csv.reader(csvFile)
next(csvRead) # skip header row

args = []
values = []

for row in csvRead:
	item_number = row[0]
	quantity = row[3]

	# insert orders into sql table
	values.append("(?,?,?,?)")
	args.extend([inventory_timestamp, item_number, quantity, uploadTime])

csvFile.close()

system.db.runPrepUpdate(query.format(values=", ".join(values), args, 'Quality', tx=txID)

pturmel · December 11, 2019, 6:47pm

If you want to push extra hard on your DB, batch a thousands rows at a time like Paul suggests, but push the runPrepUpdate() into an asynchronous thread for each batch. Spawn them inside your csv read loop as soon as you have 1000 rows. Your DB will parallelize the inserts as best it can.

robert1 · December 11, 2019, 6:54pm

Thanks, guys, I'll test that out. But if I'm catching what you're saying, maybe it is better to use a single client's resources for 15 seconds vs possibly straining the DB which is a common resource for many clients?

paul-griffith · December 11, 2019, 7:11pm

Well, databases are really good at batching. You can still compromise - batches of 1000, but serially invoked by the client (ie, no asynchronous threads). Easier programming for you, still likely to be much faster (probably close to instantaneous), and actually easier on the DB (multi-value insert statements are easily optimized, ie things like building indexes/caches/etc).

robert1 · December 16, 2019, 5:16pm

Can you guys help me with this part? I’m still on Ignition 7 (python 2.5) so format() is not available. Sorry, I should have specified earlier.

query.format(values=", ".join(values))

This seems to work, but I’m curious if this would be best practice:

values.append("?,?,?,?")
args.extend([inventory_timestamp, item_number, quantity, uploadTime])


valueString = ", ".join(values)
query = "INSERT INTO inventory_dump (dump_timestamp, item_number, quantity, upload time) \ 
         VALUES %s" %valueString

paul-griffith · December 16, 2019, 5:34pm

", ".join(sequence) is Python for "for each string in this sequence, mash them together with ", " in between; so

l = ["?,?,?,?", "?,?,?,?"]
print ", ".join(l)
>>> ?,?,?,?, ?,?,?,?

So, your replacement does seem fine; I wasn’t using any of the more advanced features of .format(), I just find it’s string replacement syntax a little clearer to read than the %s placeholders.

Looking at it again, though, I think you’ll still need the parentheses around each set of values, so
values.append("(?,?,?,?)")
Such that the final query looks like:

INSERT INTO inventory_dump (dump_timestamp, item_number, quantity, upload time)"
VALUES (?,?,?,?), (?,?,?,?), (?,?,?,?)

Without the parentheses, it’ll try to put all the values you supply into the same row, which won’t work, especially since you only specify four columns to insert into.

robert1 · December 16, 2019, 6:00pm

Whoops, yes that was a typo.

Thank you very much for the help!