Comparing dataSets by unique id in one column

waker1der · July 19, 2018, 1:47pm

HI,

part of my application I make a SELECT query from a datawarehouse table that i have read only access. I display the data in a power table and this query runs every 2 hours.

I then want to take this data and write to my own tables so I can trend and analyse.

I can INSERT to my table using pa.db.insertRows("tableName",mpds) but the down side to this is if my data is not unique I am duplicating rows.

So my plan is to shift my SELECT query result which I store in a memory tag of type dataset in to a seperate memory tag and then do a comparison of the two datasets and remove duplicate rows. I tried to use mutablePyDataSet.removeDuplicateRows() but some cells would change between the two queries leaving me with duplicate rows.

So I want to use the use one column that holds unique data like a PRIMARY KEY to identify duplicates, by building a list of the case_ids and then comparing the rows in the dataSet to see if that case_id exists but i am struggling to get it wokring

#grab datasets
lastSelect 			= system.tag.getTagValue("FC Metrics/selectFromBooker")
previousSelect 		= system.tag.getTagValue("FC Metrics/oldFromBooker")

#convert to mutablePyDatasets
mpdsLastSelect 		= pa.dataset.toData(lastSelect) #latest select query
mpdsPreviousSelect 	= pa.dataset.toData(previousSelect) #previous select query

#build list of case ids from previuos select
listCI = []
for row in mpdsPreviousSelect:
	row = row["case_id"]
	listCI.append(row)

#compare the list of case id's to the last select and remove duplicate rows
for row in mpdsLastSelect:
	for case in listCI:
		if case in mpdsLastSelect:
			mpdsLastSelect.removeRow(row)

waker1der · July 19, 2018, 1:48pm

@nmudge you maybe able to help with this

this is the exception

  File "__pyclasspath__/perfectabstractions/common/powerscripting/dataset$py.class", line 335, in __contains__
UnboundLocalError: local variable 'newRow' referenced before assignment

phillip.bow · July 19, 2018, 6:16pm

Could you select from your data warehouse into your own copy of the table that you can query more often for your analysis(and possibly even bind your power table to)?

Select XX from dwh.BookerTable
Loop through rows, upserting in to your local table
– upsert syntax is db vendor specific in a lot of cases, but what it boils down to is you perform an insert statement if your key value doesn’t already exist, otherwise you update the existing row for your key value

JordanCClark · July 20, 2018, 2:35pm

Redefining row in the middle of a loop is, methinks, asking for trouble. No harsh experience or anything, because I always have flawless code. {cough-cough}

The columns method will get you what you want for your filter list, and the filterRows method should streamline the rest.

This, I think, should work for you. I'm going on the assumption that the two datasets have the same structure. EDIT: Not sure why my formatting looks funky

#grab datasets
lastSelect 			= system.tag.getTagValue("FC Metrics/selectFromBooker")
previousSelect 		= system.tag.getTagValue("FC Metrics/oldFromBooker")

#convert to mutablePyDatasets
mpdsLastSelect 		= pa.dataset.toData(lastSelect) #latest select query
mpdsPreviousSelect 	= pa.dataset.toData(previousSelect) #previous select query

#build list of case ids from previuos select
listCI = mpdsPreviousSelect.columns["case_id"]

#compare the list of 'case id' to the last select and remove duplicate rows
mpdsLastSelect.filterRows(lambda row, listCI=listCI: row['case_id'] not in listCI)

Here is some code you can play with in the Script Console:

headers = ['ndx', 'col1', 'col2']
rows1 = [
		[1,2,3],
		[2,3,4]
		]
		
rows2 = [
		[1,2,3],
		[2,3,4],
		[3,4,5],
		[4,5,6],
		[5,6,7]
		]
		
data1 = pa.dataset.toData(headers, rows1)
data2 = pa.dataset.toData(headers, rows2)

print data1
print data2

filterList = data1.columns['ndx']
print filterList


data2.filterRows(lambda row, filterList=filterList: row['ndx'] not in filterList)

print data2