简体   繁体   中英

Python 3.4 nested loop using lambda filter working weirdly

I am trying to use NLTK's texttiling code ( https://github.com/nltk/nltk/blob/develop/nltk/tokenize/texttiling.py ).

It's a code that segments a document input into a few tiles based on its contents. I noticed that tiling doesn't work at all for some of the documents by returning the entire text as one tile, and spotted that this portion of the code is working weird.

    depth_tuples = sorted(zip(depth_scores, range(len(depth_scores))))
    depth_tuples.reverse()
    hp = filter(lambda x:x[0]>cutoff, depth_tuples)

    for dt in hp:
        boundaries[dt[1]] = 1
        for dt2 in hp: #undo if there is a boundary close already
            if dt[1] != dt2[1] and abs(dt2[1]-dt[1]) < 4 \
                   and boundaries[dt2[1]] == 1:
                boundaries[dt[1]] = 0
    return boundaries

Depth_tuple is a list that contains a list of tuples [(score, index)] and hp is a filtered result whose score is bigger than some cut-off value.

Using the nested-loop, it iterates over hp separately twice for each entry of hp. In other words, for each entry of hp, it should check something for all entry of hp. But I noticed the second loop (for dt2 in hp) is not executed after the first iteration. It's like dt2 pointer reaches at the end of hp for the first dt, and it doesn't get initialized for the new iteration.

To give you a simplified example of this phenomenon, say x = [(0.6,3),(0.2,1),(0.5,2),(0.4,3)]

if the cut-off was 0.3, hp contains [(0.6,3), (0.5, 2), (0.4, 3)]

so the loop should go like

when x = (0.6, 3), the second loop checks [(0.6,3), (0.5, 2), (0.4, 3)]

when x = (0.5, 2), the second loop again checks [(0.6,3), (0.5, 2), (0.4, 3)]

but it only does that when x=(0.6, 3), and for the rest of x, the second loop doesn't run.

I initially suspected that the iterator has reached the end of hp at the second loop, but it wouldn't explain how the iterator in hp of the first loop can still go...

Could you explain why this happens? Thanks!

You are using Python 3, and the recipe was written for Python 2. In Python 2 , filter returns a list , which obviously can be iterated over many times with for (the inner for dt2 in hp ).

However in Python 3, hp will be a one-pass iterator ; now, the outer for would consume the first element, and the inner for would consume all the remaining elements; when the inner loop exits, the outer loop finds an empty iterator and exits too.

Or, as the Python 2 and 3 documentation says, in Python 2 filter(function, iterable) is equivalent to the list comprehension

[item for item in iterable if function(item)]

and in Python 3, it is equivalent to the generator expression

(item for item in iterable if function(item))

As the simplest fix, make the iterator returned by filter into a list :

hp = list(filter(lambda x: x[0] > cutoff, depth_tuples))

I don't know why Dan D. deleted his answer . Maybe it didn't completely explain the problem, but it did give the right solution and the crucial piece of information you're missing.

Assuming this is Python 3, filter returns an iterator , not a sequence. Iterators can only be iterated once. An iterator knows its "current position", and produces the values lazily as you ask for them; once you've asked for all the values, there are no more values to give. So, for example:

>>> hp = iter([1,2,3])
>>> for dt in hp:
...     print(dt)
1
2
3
>>> for dt in hp:
...     print(dt)

The second time, it prints nothing, because you've already used all the values.

And the same thing happens in a nested loop:

>>> for dt in hp:
...     print(dt)
...     for dt in hp:
...         print('>', dt)
1
> 2
> 3

In the first iteration through the outer loop, dt gets the first value. Then the nested inner loop gets all the rest of the values, so the outer loop is done.

If you want to iterate over something repeatedly, the simplest thing to do is to convert it to a sequence:

hp = list(hp)

In some cases, it can be more efficient and/or conceptually cleaner to use tee , but that doesn't apply here. Your code is designed to treat hp as a sequence, so just make it a sequence.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM