which way is more efficient or effective? Convert to pyDataSet() once then access or normal?
Itâs more a matter of style. If you use the ânormalâ dataset youâll have to loop through it by index e.g.
for row in range(data.rowCount):
for col in range(data.columnCount):
print data.getValueAt(row, col)
When you convert it to a pyDataSet you can use the Python for ⌠in iterator e.g.
pyData = system.dataset.toPyDataSet(data)
for row in pyData:
for col in row:
print col
This is not quite true. There is a definite speed difference between the two, with access through a PyDataSet being 4 - 5x slower than access via dataset.getValueAt.
Tested against the test data from a power table
data = event.source.parent.getComponent('Power Table').data
d1 = system.date.now()
for i in range(1000):
for row in range(data.rowCount):
for col in range(data.columnCount):
t = data.getValueAt(row, col)
d2 = system.date.now()
for i in range(1000):
pyData = system.dataset.toPyDataSet(data)
for row in pyData:
for col in row:
t = col
d3 = system.date.now()
print d2.time-d1.time, d3.time-d2.time
Huh. That's a much bigger penalty than I would have thought.
There is a speed difference, but when you can step through a 1,000 element dataset in only a few milliseconds, in most applications itâs not going to make a difference. If youâre crunching through huge amounts of data though, it would be better to use the native dataset.
Interestingly it appears that the native dataset is being compiled by Java - it gets faster when you run it repeatedly, while the pyDataSet stays at roughly the same duration.
Thatâs an unfair test, because youâre instantiating a new object (the actual pyDataset) inside the loop each time. Iâd bet iteration speed is almost exactly the same without the object creation penalty. That said, I guess I see the point - if youâre doing this conversion frequently, you will have to pay that penalty. But itâs the lionâs share of the overhead; the actual iteration speeds are essentially identical.
Huh. I stand corrected, I broke out the pyDataset
and itâs still orders of magnitude slower. I wonder if I can fix that.
EDIT: Yes. Even with the PyDataset initialization in the same block.
>>>
647 335
>>>
568 291
>>>
557 269
>>>
572 277
>>>
Yes, I was originally going to only initialize the PyDataSet outside the loop, but in a real world test, you would instantiate it each time. The reason for the x1000 iteration is to average out the times and also to provide a larger number to make comparison easier.
The test would be unfair if you only instantiated it once, imo. The cost of of the PyDataSet object creation is the part that causes most of the slow down.
Yeah, you're right. Either way, I just filed a PR that drops the time (whether the pydataset is instantiated inside the loop or not) down so much it's faster than the native dataset iteration; I think the additional overhead is something about Jython wrapping the values coming out of the dataset.
Awesome! I went a bit further than this, to help quantify the numbers a bit.
I would also like to state that no matter which way you go, the performance will probably be acceptable, as this test case is for 1000 iterations. Typically, the operation will only be needed once. However, I did want to point out the performance difference.
Based off my test case, here is what I found.
Test #1 - Dataset
70th Percentile 1.0 ms
Variance 0.249012483006 ms
Standard Deviation 0.499011505885 ms
Mean 0.464444444444 ms
Test #2 - PyDataset
70th Percentile 3.0 ms
Variance 0.520503028056 ms
Standard Deviation 0.721458957984 ms
Mean 2.53444444444 ms
Here is the script
data = event.source.parent.getComponent('Power Table').data
results = {'t1':[],'t2':[]}
for i in range(1000):
d1 = system.date.now()
for row in range(data.rowCount):
for col in range(data.columnCount):
t = data.getValueAt(row, col)
d2 = system.date.now()
results['t1'].append(d2.time-d1.time)
for i in range(1000):
d1 = system.date.now()
pyData = system.dataset.toPyDataSet(data)
for row in pyData:
for col in row:
t = col
d2 = system.date.now()
results['t2'].append(d2.time-d1.time)
results['t1'] = sorted(results['t1'])[50:-50]
results['t2'] = sorted(results['t2'])[50:-50]
print "Test #1 - Dataset"
print "70th Percentile ", system.math.percentile(results['t1'],70)," ms"
print "Variance ",system.math.variance(results['t1'])," ms"
print "Standard Deviation ", system.math.standardDeviation(results['t1'])," ms"
print "Mean ", system.math.mean(results['t1'])," ms"
print
print "Test #2 - PyDataset"
print "70th Percentile ", system.math.percentile(results['t2'],70)," ms"
print "Variance ",system.math.variance(results['t2'])," ms"
print "Standard Deviation ", system.math.standardDeviation(results['t2'])," ms"
print "Mean ", system.math.mean(results['t2'])," ms"
AaaactuallyâŚ
Its the col call that causes the slowdown.
data = event.source.parent.getComponent('Power Table').data
results = {'t1':[],'t2':[]}
for i in range(1000):
d1 = system.date.now()
for row in range(data.rowCount):
for col in range(data.columnCount):
t = data.getValueAt(row, col)
d2 = system.date.now()
results['t1'].append(d2.time-d1.time)
for i in range(1000):
d1 = system.date.now()
pyData = system.dataset.toPyDataSet(data)
pyCol = pyData.getColumnNames()
for row in pyData:
for col in pyCol:
t = col
d2 = system.date.now()
results['t2'].append(d2.time-d1.time)
results['t1'] = sorted(results['t1'])[50:-50]
results['t2'] = sorted(results['t2'])[50:-50]
print "Test #1 - Dataset"
print "70th Percentile ", system.math.percentile(results['t1'],70)," ms"
print "Variance ",system.math.variance(results['t1'])," ms"
print "Standard Deviation ", system.math.standardDeviation(results['t1'])," ms"
print "Mean ", system.math.mean(results['t1'])," ms"
print
print "Test #2 - PyDataset"
print "70th Percentile ", system.math.percentile(results['t2'],70)," ms"
print "Variance ",system.math.variance(results['t2'])," ms"
print "Standard Deviation ", system.math.standardDeviation(results['t2'])," ms"
print "Mean ", system.math.mean(results['t2'])," ms"
From looking at profiler data, the actual issue is Jython having to create new Python objects for each item in the Java dataset, then reflectively find the __getitem__
methods on those objects and check whether they make an iterable object. With the changes I made, PyDataset and PyRow are âproperâ Python objects and donât pay this reflection penalty; with our without @MMaynardâs last set of changes the results are pretty much the same:
>>>
Test #1 - Dataset
70th Percentile 1.0 ms
Variance 0.225004325794 ms
Standard Deviation 0.474346208791 ms
Mean 0.658888888889 ms
Test #2 - PyDataset
70th Percentile 1.0 ms
Variance 0.229390681004 ms
Standard Deviation 0.478947472071 ms
Mean 0.355555555556 ms
>>>
Test #1 - Dataset
70th Percentile 1.0 ms
Variance 0.223570634038 ms
Standard Deviation 0.472832564485 ms
Mean 0.663333333333 ms
Test #2 - PyDataset
70th Percentile 1.0 ms
Variance 0.24956618465 ms
Standard Deviation 0.49956599629 ms
Mean 0.473333333333 ms
Either way, a fun diversion; those changes should land in an 8.0.x release pretty soon, though theyâll probably miss the 8.0.12 cutoff.
Nice work, this is the fun stuff I like to do and troubleshoot anyways. Glad it led to an action that helps make the product better.!
To close the loop - these changes just got merged in and will land in 8.0.13. While I was in there, I added index
, count
, and repeat
to PyRows. Also, as a bonus, you can do stuff like tuple unpacking with PyRows - so something like the following will work:
for row in pydataset:
id, name, value = row