Script help: Holiday edge case lead to empty datasets

zacharyw.larson · November 28, 2022, 3:07pm

for row, nextRow in zip(datasetIn, datasetIn[1:]):

After a long holiday, I don't have any data.
So when I reference nextRow, I get an error that I tried to call an undefined nextRow.

So to fix this, I thought maybe:

try:
     hoursSum= nextRow['Hours']
     hoursSum=0
except:
      return value

However, I am still getting an error that farther down, the script is erroring.

I figured it out sort of.
Looks like I needed the try error outside of the for loops because I don't think the for loop runs when there are no rows.

for row, nextRow in zip(datasetIn, datasetIn[1:]):
   #Location 1
   try:
       hoursSum= nextRow['Hours']
    hoursSum=0
    except:
        return value
    #chunk of code here manipulating row

#Location 2
 try:
    hoursSum= nextRow['Hours']
    hoursSum=0
except:
     return value
#chunk of code here manipulating nextRow ending with a return of a new dataset

jlandwerlen · November 28, 2022, 3:13pm

I would fix that problem first

pascal.fragnoud · November 28, 2022, 3:14pm

https://docs.python.org/2/library/itertools.html#itertools.izip_longest

from itertools import izip_longest as zipl

foo = [1]
print [(row, next_row) for row, next_row in zipl(foo, foo[1:])]
foo = []
print [(row, next_row) for row, next_row in zipl(foo, foo[1:])]

[(1, None)]
[]

zacharyw.larson · November 28, 2022, 3:24pm

@jlandwerlen
There is nothing to "fix" about not having data over a long holiday.
If nothing runs for four days, there is no data for the last 3 days.

@pascal.fragnoud
That code from the python site is using so many things I don't know that I don't understand it's details about izip_longest.
I am not sure what you are saying.

You mean the for loop is running, it returns an empty list when empty?

jlandwerlen · November 28, 2022, 3:29pm

Good point, so your code needs to handle not having data.

pascal.fragnoud · November 28, 2022, 3:34pm

The code from the website is just here to show a possible implementation of izip_longest. It's supposed to help understand what the function does.

I'm not sure what your use case is, as the code you showed makes little to no sense to me, so this may not be pertinent to your issue.
If I knew what you're trying to do, I could maybe be more helpful, but right now I don't really see the issue you ran into...

I mean, you say you get an issue about an undefined nextRow, but I have no idea where and how you're calling this variable. The only way using a variable defined in this way (ie: x defined with for x in something) is calling it outside of the loop. Which clearly, you shouldn't do.
Your edit does show a little more details, but the indentation is messed up and there's no way this can ever run.
You're also nullifying an assignment (hoursSum = nextRow['hours']) by immediately afterward setting it to 0.

zacharyw.larson · November 28, 2022, 3:42pm

I edited the top post as I figured a way to catch the error.
Thanks for helping me and showing me that doc page.

I figured the script purpose didn't matter much.
I think "an error of calling a variable before it is defined" might be so common that there is a common solution.

pascal.fragnoud · November 28, 2022, 3:48pm

There is a common solution: Don't do it.
Now, how not to do it depends on how it was done. Which is why seeing your code would be helpful.
I feel a try/catch is probably not the best solution to this problem.

lrose · November 28, 2022, 3:49pm

There is, don't do it.

The problem here is that you are trying to index an empty list. You should make sure that there is data in the list prior to trying to access it if there is a chance that the list is empty.

#assuming that datasetIn is a pyDataSet
if datasetIn.rowCount:
    for row, nextRow in zip(datasetIn, datasetIn[1:]):

zacharyw.larson · November 28, 2022, 3:52pm

Thanks

If I want to avoid running my code when 0 or 1 row, this is a good technique?

if datasetIn.rowCount>1:

pascal.fragnoud · November 28, 2022, 3:54pm

If what you want is not to run something on a dataset that has less than 2 rows, then yes.
But if there's a reason why you don't want to run something on this dataset, then...

zacharyw.larson · November 28, 2022, 4:30pm

On that doc page, that is the function definition for that library?
I thought it was an example, but it was so complicated.

It must be a complicated example.

class ZipExhausted(Exception):
    pass

def izip_longest(*args, **kwds):
    # izip_longest('ABCD', 'xy', fillvalue='-') --> Ax By C- D-
    fillvalue = kwds.get('fillvalue')
    counter = [len(args) - 1]
    def sentinel():
        if not counter[0]:
            raise ZipExhausted
        counter[0] -= 1
        yield fillvalue
    fillers = repeat(fillvalue)
    iterators = [chain(it, sentinel(), fillers) for it in args]
    try:
        while iterators:
            yield tuple(map(next, iterators))
    except ZipExhausted:
        pass

I don't know:
ZipExhausted
Raise
Yield
Repeat
Chain

I will look them up.

lrose · November 28, 2022, 5:00pm

The zip() function by itself will truncate a longer list to the length of the shortest.

So

list1 = ['A','B','C','D']
list2 = ['x','y']

for i,j in zip(list1,list2):
    print i,j

Will yield

Ax
By

Where this:

from itertools import izip_longest as zipl

list1 = ['A','B','C','D']
list2 = ['x','y']

for i,j in zipl(list1,list2):
    print i,j

Will yield:

Ax
By
C
D

raise will "raise" an exception. In this case they have defined a default exception ZipExhausted.

yield is used to "return" the next iteration in a generator.

repeat and chain are other functions inside of the itertools package.

bkarabinchak.psi · November 28, 2022, 5:21pm

I feel like this has become overly complicated.

I don't see your full code but it seems like you need to look at your current data and your next row's data to do whatever it is you need to do. Is that correct?

I don't really use zip I am sure there is a way to with it but it would confuse my coworkers so here's how I would do it with enumerate -

l = [1,2,3,4]

for row, item in enumerate(l):
	try:
		curItem = item
		nextItem = l[row+1]
        # Do something with both row's data
	except IndexError:
		# You are at the last item in list so you can stop now
		# pass
        break

Is there some other condition you have that makes this method unsuitable for you?

lrose · November 28, 2022, 5:26pm

The real issue of the thread was trying to index an empty list.

The rest is just in response to @pascal.fragnoud recommending the izip_longest function and the example code using some built-ins and functions without much context to why they're used.

zip() works just fine for what he is trying to accomplish, so long as there is actually data in the dataset.

bkarabinchak.psi · November 28, 2022, 5:35pm

Oh I see so sometimes datasetIn was zero rows, so zip(datasetIn, datasetIn[1:]) was causing the error due to datasetIn[1:] throws an IndexError. I see the that zip(datasetIn, datasetIn[1:]) makes sense to me now, pretty nifty.

Yea seems the easiest way would just be checking dataset length prior and only running it if there's enough data present or doing something else if there is not.

@zacharyw.larson I would not do a deep dive into itertools and all the other python generator keywords, I would just do your if datasetIn.rowCount>1: check prior to running your script.

pturmel · November 28, 2022, 5:50pm

The technique suggests the source SQL needs a LEAD() function for columns of interest. Which will automatically deliver a null in the last row, and greatly simplify the jython. (The DB will almost certainly be faster, too.)

zacharyw.larson · November 28, 2022, 6:04pm

I liked learning about the keywords on the w3schools.

I hadn't realized there were so many more that I didn't know about.

pascal.fragnoud · November 28, 2022, 7:18pm

The function is not an example of its use, it's what its implementation might look like - though it's probably much more complex than that - to give you a better understanding of what it does.

But I suggested that function because I misunderstood your problem, you probably don't need it in this case.

pascal.fragnoud · November 28, 2022, 7:26pm

'Slicing' an empty list should not raise errors. It just returns an empty list.

Which is used in some... not so pretty technics:

some_list = ["foo", "bar", "baz"]
x = [some_list[1:2] or ['not found']][0]
# x == "bar"

some_list = []
x = [some_list[1:2] or ["not found"]][0]
# x == "not found"

and things like that. Don't do it :X
It's simpler and clearer to just check if len(some_list) > the_index_you_want_to_use

The error was/is probably caused by an use of the variable declared in the loop definition outside of the loop itself, something like:

for x in []:
    pass
print x