Reordering a query resultset by list of ids

I've run into an issue that's crashing my gateway...
I have a list of part ids stored in an array in a custom component prop. The ids can be in any particular order, and not sequental/incremental.

I need to run a query on a database table and order it in the order of the ids in the custom prop list.
The most basic example of the query is:

SELECT id, others
FROM table
WHERE id IN ({custom.partIds})

UNION
SELECT -3600, others
UNION
SELECT -25300, others

1st question: is it possible to order by the part Ids in the query itself?

Note: the UNIONS are dynamically added into the SQL query. These ids are "special" cases and don't actually exist as ids in the SQL table, but still need to be returned in the list. I could have also added these onto the resultset, but chose to do it in the query itself.

The resultset will come in the wrong id order, so I need to order it according to the part ids custom array, which currently I'm doing in a wildly inefficient python function which loops through all part ids and resultset rows, and essentially recreates the resultset :confused:
If I have >~350 part Ids, the gateway crashes while running the sorting function... If I remove the sorting, it is ok.

Unless you really need to see my abomination of a function, I can post it I've posted it. But are there any more efficient ways to reorder this?

LOGGER = system.util.getLogger('Project-Scripting-shared.util.dataset')
def sortDatasetByFieldValues(dataset, fieldNames, fieldValues, DEBUG=False):
	"""
	Sorts a DataSet or pyDataSet by the fieldValues passed in for the fieldName in the dataset.
	
	Notes:
		- if a field value in the list doesn't exist in the dataset, an error is returned
		- rows will be filtered out of the dataset if the value doesn't exist in the fieldValues
		- if the fieldValues contain duplicate values, duplicates will also appear in the returned dataset
	
	Args:
		fieldNames [list]				- a list of field names to order, in order or precedence
										  E.g. ['partId', 'part_length']
		fieldValues [list of lists]		- a list of lists containing the values of the field names in each list of list
										  E.g. [[1005,1004,1,1001,1], [None, None, 3700, None, 1500]]
	"""
	fna = 'sortDatasetByFieldValues(dataset={}, fieldNames={}, fieldValues={})'.format(dataset, fieldNames, fieldValues)
	
	t = [nanoTime()]
	data_sorted = []
	valueFound = False
	headers = list(dataset.columnNames)
	
	for fieldValueIndex in range(len(fieldValues[0])):
		valueFound = False
		if DEBUG: print 'looking for ', [None if fieldValues[i][fieldValueIndex] is None else fieldValues[i][fieldValueIndex] for i in range(len(fieldValues))]
		
		for row in range(dataset.rowCount):
			if DEBUG: print '  row:', row, 'value:', [dataset.getValueAt(row, fieldName) for fieldName in fieldNames]
			if all(fieldValues[i][fieldValueIndex] is None or str(dataset.getValueAt(row, fieldNames[i])) == str(fieldValues[i][fieldValueIndex]) for i in range(len(fieldNames))):
				if DEBUG: print '  found fieldValue'
				data_sorted.append([dataset.getValueAt(row, header) for header in headers])
				# remove the row from the dataset to reduce search time!!
				#dataset = system.dataset.deleteRow(dataset, row) # this isn't working... rows cannot be found if this is in
				valueFound = True
				break
		
		if not valueFound:
			if DEBUG: print '  NOT FOUND!'
			msg = '{}\r\nField `{}` value `{}` not found in SQL table.'.format(fna, fieldNames, ','.join(map(str, [None if fieldValues[i][fieldValueIndex] is None else fieldValues[i][fieldValueIndex] for i in range(len(fieldValues))])))
			
			LOGGER.error(msg)
			raise ValueError(msg)
	
	t.append(nanoTime())
	LOGGER.trace('{}: Finished sorting. Time: {} us'.format(fna, (t[1]-t[0])/1000))
	
	ds_sorted = system.dataset.toDataSet(headers, data_sorted)
	
	if 'PyDataSet' in str(type(dataset)):
		ds_sorted = system.dataset.toPyDataSet(ds_sorted)
	
	if DEBUG: print 'sorted dataset:\r\n'
	if DEBUG: system.dataset.print(ds_sorted)
	return ds_sorted

From the SQL query point of view could you do something like?:

SELECT *
	FROM (SELECT id
		FROM devices
		WHERE id in (113,112,114)
		UNION
		SELECT 101) AS tbl
ORDER BY id

That will order by asc/desc ids though, not by a defined order, e.g. Ids: 1005, 1001, -3700, 1010, -25000, etc.

ORDER BY Id will order the above as: -25000, -3700, 1001, 1005, 1010

Right get you, sorry didn't see the post edit with the function and misread the OP :sweat_smile:

try this, see if you get the same issue:

doesn't work, needs fixing
def sortds(ds, fields, order_lists):
	headers = list(ds.getColumnNames())
	data = list(system.dataset.toPyDataSet(ds))
	indices = [headers.index(f) for f in fields]
	data.sort(key=lambda(row): (order.index(i) for order, i in zip(order_lists, indices)))
	return system.dataset.toDataSet(headers, data)
1 Like

I'll give it a go tonight in a couple of hours, I actually went looking to use sort but wasn't sure it would work, looks like it'd be faster than mine though hopefully :crossed_fingers:

No reason for sort not to work, but it does require casting the dataset to a list... which also means transforming the dataset to a pydataset.

Might or might not be faster. You'll have to try to know !
I'd do it, but I can't seem to make your function work, and frankly it's too complex to debug quickly :X

Hahaha, yes... I wouldn't call it particularly nice. My plan was to add comments and simplify it once I proved it. But definitely welcome using sort!

edit: Okay so it doesn't actually work. I broke something when changing it to accept multiple sort parameters, I'll try and fix it when I can.

In case you need to adjust a few things, here's how it works (it's not super complicated, but it might not be super intuitive at first):

  • get the indices of the columns corresponding to the fields we want to sort by
  • .sort can sort by multiple 'fields', ie .sort(lambda(row): (row[1], row[0]))
  • since we have a dynamic number of fields to sort by, we need to construct the tuple dynamically as well, using a comprehension
  • we reference the 'order_lists' in sequence, find the index where the current row's value is, and sort by that

It's a bit packed, but the logical flow is not too complex.

edit: wait, this needs some fixing. I'm not sure why but the sorting doesn't always work.

1 Like

Well it's looking promising in terms of speed! 135x and 218x faster, just a bit better :crazy_face:
image

image

I haven't checked output yet

Well... I think I got it. 2 issues I introduced when I went from simple sort parameter to dynamic sort parameters:

  • I boinked the order.index() parameter, which should be order.index(row[i]) instead of order.index(i)
  • ................. The whole key wasn't a tuple, but a generator.
    by random luck, the output actually matched the sort parameters when I first tested it.

fixed version, should produce correct output:


def sortds(ds, fields, order_lists):
	headers = list(ds.getColumnNames())
	data = list(system.dataset.toPyDataSet(ds))
	indices = [headers.index(f) for f in fields]
	data.sort(key=lambda(row): tuple(order.index(row[i]) for order, i in zip(order_lists, indices)))
	return system.dataset.toDataSet(headers, data)

note: you can replace tuple(x) by a simple list comprehension ([x]). Should work the same.

Oh, wait... I didn't account for when the value in the dataset is not found in the order list...
I'll check that later, but what should happen in this case ?

What about this one, Nick? I'm assuming you want the id's ordered before you add the unions, if I'm reading this correctly.

SELECT * 
FROM (SELECT id, others
      FROM table
      WHERE id IN ({custom.partIds})
      ORDER BY id)

UNION
SELECT -3600, others
UNION
SELECT -25300, others

Hmm, I'm getting an error running that updated function.

 File "<module:shared.util.dataset>", line 81, in sortds
  File "<module:shared.util.dataset>", line 81, in <lambda>
  File "<module:shared.util.dataset>", line 81, in <genexpr>
ValueError: list.index(x): x not in list

Which is here:

data.sort(key=lambda(row): tuple(order.index(row[i]) for order, i in zip(order_lists, indices)))

I may have forgotten to mention a small (big) detail. I actually need to sort on both the part_id and the part_length values.

I'll make it easier to imagine. Below generates an example cutdown dataset that I get back from the query:

ds = system.dataset.builder(part_id=int, part_length=int).build()
ds = system.dataset.addRows(ds, [
	[1000, 2500],
	[1001, 2200],
	[1002, 254],
	[1003, 1250],
	[1, 3700],
	[0, 2200],
	[1, 2250]])
pds = system.dataset.toPyDataSet(ds)
### sortds(ds, ['part_id', 'part_length'], [
###	[1, 1, 0, 1000, 1001, 1003, 1002], [3700, 2250, 2200, None, None, None, None]
###	])
# it might make more sense to zip the last arg:
sortds(ds, ['part_id', 'part_length'], zip(
	[1, 1, 0, 1000, 1001, 1003, 1002], [3700, 2250, 2200, None, None, None, None]
	))

You can see that the part_id values are ascending up until the 0/1's which are the special cases, which can appear in any order at the bottom.

The order that I want them to be rearranged in, using the notation [part_id, part_length], is:

[1, 3700]
[1, 2250]
[0, 2200]
[1000, <don't care*>]
[1003, <don't care*>]
[1002, <don't care*>]

*the reason I don't care is because I don't know the part_lengths before running the query. The parts table holds this, and the part_length never changes for a given part_id. However, for the 0/1 ids, these parts have variable lengths that are determined by the operator, which is why they're inserted at the bottom in the SQL query.

Nope, I do want to order all of the Ids, including those in the unions. But an ORDER BY will still only order ids ascending or descending; not in a specified order that I request (for example: 1000, 1004, 1002, 1005, 1003)

Yes, that's the 'when a value from the query result is not in the order list' part that I ommited.
But since I don't know what should happen in that case, I can't really implement it.

Well, yea, that's what passing a tuple to sort's key does.
Okay, not a tuple, a function that returns a tuple.

Every item in that tuple will be used in the sort.

this:

data = [
	['foo', 21, 'a'],
	['bar', 42, 'b'],
	['baz', 42, 'c'],
	['pox', 84, 'a'],
]
headers = ['name', 'num', 'char']
ds = system.dataset.toDataSet(headers, data)

order_fields = [
	'num',
	'char'
]
order_lists = [
	[42, 84, 21],
	['c', 'b', 'a']
]

ds = sortds(ds, order_fields, order_lists)

will produce this:

[u'baz', 42, u'c']
[u'bar', 42, u'b']
[u'pox', 84, u'a']
[u'foo', 21, u'a']

edit:
wait, why are you zipping your order lists ?

You can leave it unzipped if it makes more sense.

Ok, so my None values aren't in the list, but these should be ignored and any of the part_ids that match can be used

E.g. if I had:
part_id, part_length
1004, 1250
1001, 450
1001, 450

every part_id == 1001 will have a length of 450, so you can match any of them

Well it expects as many order lists as order fields.
If you zip your 2 lists together, it will produce a list of n tuples where n is the length of the shortest list.
The way I coded that function, zipping your lists is just not gonna work

Got it, unzipped it is. But see my edit above. Meanwhile, I'm working through the lamdba function working out how to modify to suit.

I have situations where I need to ORDER BY alphabetically but need to do a few special rows on top but after those rows, just doing a normal order by. Here's an example using MySQL
ORDER BY (CASE WHEN idx=-1 THEN 0 ELSE 1 END) ASC, (CASE WHEN idx = 2882 THEN 0 ELSE 1 END) ASC, (CASE WHEN idx = 2442 THEN 0 ELSE 1 END) ASC, (CASE WHEN idx = 2559 THEN 0 ELSE 1 END) ASC, (CASE WHEN idx = 2879 THEN 0 ELSE 1 END) ASC,name

If you have a custom order for every single row and there is never a simple ASC/DESC this won't help, but if you only need to get a few rows on top and then do a normal order by this works for me.

Combine with this SELECT * FROM (SELECT blah FROM table UNION blah UNION blah2) ORDER BY (CASE etc)

You can define a function instead of using a lambda, if it makes things easier.

If you need to handle cases where not all values are in the order lists, a lambda might be too limited