PyDataset Performance Questions

which way is more efficient or effective? Convert to pyDataSet() once then access or normal?

It’s more a matter of style. If you use the ‘normal’ dataset you’ll have to loop through it by index e.g.

for row in range(data.rowCount):
    for col in range(data.columnCount):
        print data.getValueAt(row, col)

When you convert it to a pyDataSet you can use the Python for … in iterator e.g.

pyData = system.dataset.toPyDataSet(data)
for row in pyData:
    for col in row:
        print col

This is not quite true. There is a definite speed difference between the two, with access through a PyDataSet being 4 - 5x slower than access via dataset.getValueAt.

Tested against the test data from a power table

data = event.source.parent.getComponent('Power Table').data

d1 = system.date.now()

for i in range(1000):
	for row in range(data.rowCount):
		for col in range(data.columnCount):
			t = data.getValueAt(row, col)
			
d2 = system.date.now()

for i in range(1000):
	pyData = system.dataset.toPyDataSet(data)
	for row in pyData:
		for col in row:
			t = col

d3 = system.date.now()

print d2.time-d1.time, d3.time-d2.time
2 Likes

Huh. That’s a much bigger penalty than I would have thought.

There is a speed difference, but when you can step through a 1,000 element dataset in only a few milliseconds, in most applications it’s not going to make a difference. If you’re crunching through huge amounts of data though, it would be better to use the native dataset.

Interestingly it appears that the native dataset is being compiled by Java - it gets faster when you run it repeatedly, while the pyDataSet stays at roughly the same duration.

That’s an unfair test, because you’re instantiating a new object (the actual pyDataset) inside the loop each time. I’d bet iteration speed is almost exactly the same without the object creation penalty. That said, I guess I see the point - if you’re doing this conversion frequently, you will have to pay that penalty. But it’s the lion’s share of the overhead; the actual iteration speeds are essentially identical.

Huh. I stand corrected, I broke out the pyDataset and it’s still orders of magnitude slower. I wonder if I can fix that.

EDIT: Yes. Even with the PyDataset initialization in the same block.

>>> 
647 335
>>> 
568 291
>>> 
557 269
>>> 
572 277
>>> 

Yes, I was originally going to only initialize the PyDataSet outside the loop, but in a real world test, you would instantiate it each time. The reason for the x1000 iteration is to average out the times and also to provide a larger number to make comparison easier.

The test would be unfair if you only instantiated it once, imo. The cost of of the PyDataSet object creation is the part that causes most of the slow down.

Yeah, you’re right. Either way, I just filed a PR that drops the time (whether the pydataset is instantiated inside the loop or not) down so much it’s faster than the native dataset iteration; I think the additional overhead is something about Jython wrapping the values coming out of the dataset.

2 Likes

Awesome! I went a bit further than this, to help quantify the numbers a bit.

I would also like to state that no matter which way you go, the performance will probably be acceptable, as this test case is for 1000 iterations. Typically, the operation will only be needed once. However, I did want to point out the performance difference.

Based off my test case, here is what I found.

Test #1 - Dataset
70th Percentile  1.0 ms
Variance  0.249012483006 ms
Standard Deviation  0.499011505885 ms
Mean  0.464444444444 ms

Test #2 - PyDataset
70th Percentile  3.0 ms
Variance  0.520503028056 ms
Standard Deviation  0.721458957984 ms
Mean  2.53444444444 ms

Here is the script

data = event.source.parent.getComponent('Power Table').data

results = {'t1':[],'t2':[]}
for i in range(1000):
	d1 = system.date.now()
	for row in range(data.rowCount):
		for col in range(data.columnCount):
			t = data.getValueAt(row, col)
	d2 = system.date.now()
	results['t1'].append(d2.time-d1.time)		

for i in range(1000):
	d1 = system.date.now()
	pyData = system.dataset.toPyDataSet(data)
	for row in pyData:
		for col in row:
			t = col
	d2 = system.date.now()
	results['t2'].append(d2.time-d1.time)	

results['t1'] = sorted(results['t1'])[50:-50]
results['t2'] = sorted(results['t2'])[50:-50]

print "Test #1 - Dataset"
print "70th Percentile ", system.math.percentile(results['t1'],70)," ms"
print "Variance ",system.math.variance(results['t1'])," ms"
print "Standard Deviation ", system.math.standardDeviation(results['t1'])," ms"
print "Mean ", system.math.mean(results['t1'])," ms"
print
print "Test #2 - PyDataset"
print "70th Percentile ", system.math.percentile(results['t2'],70)," ms"
print "Variance ",system.math.variance(results['t2'])," ms"
print "Standard Deviation ", system.math.standardDeviation(results['t2'])," ms"
print "Mean ", system.math.mean(results['t2'])," ms"

Aaaactually…
Its the col call that causes the slowdown.


data = event.source.parent.getComponent('Power Table').data

results = {'t1':[],'t2':[]}
for i in range(1000):
	d1 = system.date.now()
	for row in range(data.rowCount):
		for col in range(data.columnCount):
			t = data.getValueAt(row, col)
	d2 = system.date.now()
	results['t1'].append(d2.time-d1.time)		

for i in range(1000):
	d1 = system.date.now()
	pyData = system.dataset.toPyDataSet(data)
	pyCol = pyData.getColumnNames()
	for row in pyData:
		for col in pyCol:
			t = col
	d2 = system.date.now()
	results['t2'].append(d2.time-d1.time)	

results['t1'] = sorted(results['t1'])[50:-50]
results['t2'] = sorted(results['t2'])[50:-50]

print "Test #1 - Dataset"
print "70th Percentile ", system.math.percentile(results['t1'],70)," ms"
print "Variance ",system.math.variance(results['t1'])," ms"
print "Standard Deviation ", system.math.standardDeviation(results['t1'])," ms"
print "Mean ", system.math.mean(results['t1'])," ms"
print
print "Test #2 - PyDataset"
print "70th Percentile ", system.math.percentile(results['t2'],70)," ms"
print "Variance ",system.math.variance(results['t2'])," ms"
print "Standard Deviation ", system.math.standardDeviation(results['t2'])," ms"
print "Mean ", system.math.mean(results['t2'])," ms"

From looking at profiler data, the actual issue is Jython having to create new Python objects for each item in the Java dataset, then reflectively find the __getitem__ methods on those objects and check whether they make an iterable object. With the changes I made, PyDataset and PyRow are “proper” Python objects and don’t pay this reflection penalty; with our without @MMaynard’s last set of changes the results are pretty much the same:

>>>
Test #1 - Dataset
70th Percentile  1.0  ms
Variance  0.225004325794  ms
Standard Deviation  0.474346208791  ms
Mean  0.658888888889  ms

Test #2 - PyDataset
70th Percentile  1.0  ms
Variance  0.229390681004  ms
Standard Deviation  0.478947472071  ms
Mean  0.355555555556  ms
>>> 
Test #1 - Dataset
70th Percentile  1.0  ms
Variance  0.223570634038  ms
Standard Deviation  0.472832564485  ms
Mean  0.663333333333  ms

Test #2 - PyDataset
70th Percentile  1.0  ms
Variance  0.24956618465  ms
Standard Deviation  0.49956599629  ms
Mean  0.473333333333  ms

Either way, a fun diversion; those changes should land in an 8.0.x release pretty soon, though they’ll probably miss the 8.0.12 cutoff.

6 Likes

Nice work, this is the fun stuff I like to do and troubleshoot anyways. Glad it led to an action that helps make the product better.!

3 Likes

To close the loop - these changes just got merged in and will land in 8.0.13. While I was in there, I added index, count, and repeat to PyRows. Also, as a bonus, you can do stuff like tuple unpacking with PyRows - so something like the following will work:

for row in pydataset:
    id, name, value = row
6 Likes