Help combining two datasets based on Condition

dkhayes117 · February 17, 2020, 4:46pm

The first dataset, ds1, has several columns, the first column is ‘user’. I have a second dataset, ds2, that has two columns [‘premp#’,‘prname’]. I want to add a column to ds1 where it inserts the correct employee number, premp#, by the correct user (user = prname). I’m having trouble figuring out how to do the comparison between user and prname

zacslade · February 17, 2020, 5:15pm

I think something like this would do it:

newColumnName = "premp#"
newColumnData = []
for row in range(ds1.getRowCount()):
	for ds2index in range(ds2.getRowCount()):
		if ds1.getValueAt(row, "user") == ds2.getValueAt(ds2index, "prname"):
			newColumnData.append(ds2.getValueAt(ds2index, "premp#"))
			break

ds1 = system.dataset.addColumn(ds1, 0, newColumnData, newColumnName)

But if it’s a large table, you’ll probably get more efficiency by sorting both columns first, and then looping through ds1 but manually incrementing your index in ds2 whenever the value at “user” changes in ds1, and then incrementing again if they don’t match.
… like this:

newColumnName = "premp#"
newColumnData = []
ds1 = system.dataset.sort(ds1, "user")
ds2 = system.dataset.sort(ds2, "prname")
ds2index = 0

for row in range(ds1.getRowCount()):
	while (ds1.getValueAt(row, "user") != ds2.getValueAt(ds2index, "prname")):
		ds2index += 1
	if ds1.getValueAt(row, "user") == ds2.getValueAt(ds2index, "prname"):
		newColumnData.append(ds2.getValueAt(ds2index, "premp#"))

ds1 = system.dataset.addColumn(ds1, 0, newColumnData, newColumnName)

Of course, both of these depend on the idea that every user has a prname. If not, you’ll be stuck using the first one, and you’ll have to append empty rows sometimes, like this:

newColumnName = "premp#"
newColumnData = []
success = 0
for row in range(ds1.getRowCount()):
	for ds2index in range(ds2.getRowCount()):
		if ds1.getValueAt(row, "user") == ds2.getValueAt(ds2index, "prname"):
			newColumnData.append(ds2.getValueAt(ds2index, "premp#"))
			success = 1
			break
	if success == 0:
		newColumnData.append(0)

ds1 = system.dataset.addColumn(ds1, 0, newColumnData, newColumnName)

bpreston · February 17, 2020, 5:15pm

Would something like this work? The top part is just creating a dataset to test it with.

header = ['name','number']
ds1 = [['bp',123],['pb',321]]
ds1 = system.dataset.toDataSet(header,ds1)


header = ['premp#','prname']
ds2 = [[123,'bp'],[321,'pb']]
ds2 = system.dataset.toDataSet(header,ds2)
print ds1
print ds2

ds1 = system.dataset.toPyDataSet(ds1)
ds2 = system.dataset.toPyDataSet(ds2)
print ds1
print ds2

fd = []

for name, num in ds1:
	for premp, prname in ds2:
		if prname == name:
			fd.append([name,num,premp])
			break

header = ['name','number','premp']		
fd = system.dataset.toDataSet(header,fd)

Right after I hit post I saw zacslade beat me to it.

witman · February 17, 2020, 5:48pm

Here’s an approach similar to @zacslade’s third version for handling possible no match rows, only it converts the datasets to PyDataSets for more readable code. I’m not sure what difference it may make to efficiency.

# Create sample datasets.
ds1 = system.dataset.toDataSet(['user','other'],[['juser', 'emp3'],['kuser', 'emp2'],['muser','noMatch']])
ds2 = system.dataset.toDataSet(['premp#','prname'],[[2, 'kuser'],[1, 'luser'],[3, 'juser']])

# Create new column for ds1 with employee numbers.
newColumn = []
for record in system.dataset.toPyDataSet(ds1):
	# Default empNumber to None if no match found.
	empNumber = None
	for row in system.dataset.toPyDataSet(ds2):
		if record['user'] == row['prname']:
			empNumber = row['premp#']
			break
	newColumn.append(empNumber)

# Create new dataset from ds1 with new column inserted second from left.
ds3 = system.dataset.addColumn(ds1, 1, newColumn, 'premp#', int)

# Print results.
print "ds1"
for row in system.dataset.toPyDataSet(ds1):
	print [col for col in row]
print "ds2"
for row in system.dataset.toPyDataSet(ds2):
	print [col for col in row]
print "ds3"
for row in system.dataset.toPyDataSet(ds3):
	print [col for col in row]

Output:

>>> 
ds1
[u'juser', u'emp3']
[u'kuser', u'emp2']
[u'muser', u'noMatch']
ds2
[2, u'kuser']
[1, u'luser']
[3, u'juser']
ds3
[u'juser', 3, u'emp3']
[u'kuser', 2, u'emp2']
[u'muser', None, u'noMatch']
>>>

dkhayes117 · February 17, 2020, 6:28pm

I had to make a few small changes to deal with types and whitespace from the datasets (came from query) to make @witman’s script work. Appreciate the help everyone! This is the best forum I’ve ever used