Is there a union function in 8.1

I am trying to do something like this,
combined_dataset = system.dataset.union(dataset1, dataset2, ['common_column'])
but the error message would seem to suggest that system.dataset object does not have a union attribute.
It was my understanding that the union function was introduced in 8.0. Is that the case?

No, not directly. The closest thing would be system.dataset.addRows(), with the rows argument converted to a PyDataset.

I'm not sure what you mean by the common_column bit--perhaps you are looking for a join?

I have two datasets, each with two columns. Each dataset has a column called t_stamp column[0]. This is what I was implying by 'common_column'.

I am wanting to create a new dataset as an n x 3, effectively dropping one of the t_stamp columns in the merge.

In SQL speak, this operation is not called union. That would be stacking rows of data from separate queries on top of each other. What you are wanting to do is combine data on a primary key.

If the number of rows from each dataset are guaranteed to be the same with the same primary keys in the same order, then this could be scripted using system.dataset.addColumn

However, if there is no guarantee that the number of rows will be the same or that the primary keys will be perfectly aligned, it would probably be best to develop a generic library script that can handle this task.

Example:

# Combines two datasets on a given primary key column
def combineOnColumn(datasetOne, datasetTwo, columnName):
	# Get the headers from the first and second datasets
	headersOne = system.dataset.getColumnHeaders(datasetOne)
	headersTwo = system.dataset.getColumnHeaders(datasetTwo)
	
	# Identify which primary key column to keep in the headers, and which one to drop
	# ...if the key column is missing from either dataset, return None because the datasets can't be combined 
	if columnName in headersOne and columnName in headersTwo:
		droppedKeyColumn = headersOne.index(columnName)
		keptKeyColumn = headersTwo.index(columnName)
		headersOne.remove(columnName)
	else:
		return
	
	# Get the primary keys from both datasets
	datasetOneKeys = datasetOne.getColumnAsList(droppedKeyColumn)
	datasetTwoKeys = datasetTwo.getColumnAsList(keptKeyColumn)
	
	# Create the headers and data parameters for the eventual combined dataset
	combinedHeaders = headersOne + headersTwo
	combinedData = []
	
	# Get each row of the first dataset
	for key in xrange(datasetOne.rowCount):
		rowData = [datasetOne.getValueAt(key, column) for column in xrange(datasetOne.columnCount) if column != droppedKeyColumn]
		
		# Check to see if the primary key value for the current row of the first dataset is in the second dataset
		# ...if so, add the row values from the corresponding row in the second dataset to the first dataset's row
		if datasetOneKeys[key] in datasetTwoKeys:
			rowData += [datasetTwo.getValueAt(datasetTwoKeys.index(datasetOneKeys[key]), column) for column in xrange(datasetTwo.columnCount)]
		
		# ...if not, fill the remaining combined row's columns with null values
		else:
			rowData += [datasetOneKeys[key] if column == keptKeyColumn else None for column in xrange(datasetTwo.columnCount)]
	
	# Go through the primary keys from the second dataset to see if any were missing from the first dataset
	# ...if so, add them to the combined data filling in null values for the first dataset's columns in the combined dataset
	for key in xrange(datasetTwo.rowCount):
		if datasetTwoKeys[key] not in datasetOneKeys:
			rowData = [None for column in xrange(datasetOne.columnCount) if column != droppedKeyColumn]
			rowData += [datasetTwo.getValueAt(key, column) for column in xrange(datasetTwo.columnCount)]
			combinedData.append(rowData)
	
	# Create the combined dataset and return it
	return system.dataset.toDataSet(combinedHeaders, combinedData)

If the above function's library script is called datasetScripts, then the function would be called and used like this:

# Get the first dataset
datasetOne = event.source.parent.getComponent('Power Table').data

# Get the second dataset
datasetTwo = event.source.parent.getComponent('Power Table 1').data

# Combine the datasets on a primary key column using a custom function in the script library
combinedDataset = datasetScripts.combineOnColumn(datasetOne, datasetTwo, 't_stamp')

# Use the combined dataset in some way
event.source.parent.getComponent('Power Table 2').data = combinedDataset

Result:

Edit: Changed try/except logic to if/else logic for the key column verification.

Edit 2: Noticed the time stamp was omited from the combined set when the key value from dataset one was not in dataset 2. Added the missing condition to include the time stamp

2 Likes

I think, maybe you made this a bit harder than it needed to be?

Of course I could also be wrong.

I think, perhaps, this would result in the same thing:

def combineDatasetsByColumn(dataset1, dataset2, column):
    columnsToKeep = [name for name in dataset2.columnNames if name != column]

    filteredDataset = system.dataset.filterColumns(dataset2, columnsToKeep)
    for index, columnName in enumerate(filteredDataset.columnNames):
        dataset1 = system.dataset.addColumn(dataset1, 
                                            dataset1.columnCount, 
                                            filteredDataset.getColumnAsList(index),
                                            columnName,
                                            filteredDataset.getColumnType(index))

    return dataset1
1 Like

This assumes that both datasets have the same number of rows, and that key column in the first dataset is identical to the key column in the second dataset.

My function allows for more complicated scenarios such as mismatched key columns and datasets with different row counts.

1 Like

Thanks Irose, I will look at your suggestion and see it if works for my needs.

Thanks justinedwards.jle, I will look at your suggestion and see it if works for my needs.