简体   繁体   中英

How do I get at the contents of an iterator?

I'm thoroughly puzzled. I have a block of HTML that I scraped out of a larger table. It looks about like this:

<td align="left" class="page">Number:\xc2\xa0<a class="topmenu" href="http://www.example.com/whatever.asp?search=724461">724461</a> Date:\xc2\xa01/1/1999 Amount:\xc2\xa0$2.50 <br/>Person:<br/><a class="topmenu" href="http://www.example.com/whatever.asp?search=LAST&amp;searchfn=FIRST">LAST,\xc2\xa0FIRST </a> </td>

(Actually, it looked worse, but I regexed out a lot of line breaks)

I need to get the lines out, and break up the Date/Amount line. It seemed like the place to start was to find the children of that block of HTML. The block is a string because that's how regex gave it back to me. So I did:

text_soup = BeautifulSoup(text)
text_children = text_soup.find('td').childGenerator()

I've worked out that I can only iterate through text_children once , though I don't understand why that is. It's a listiterator type, which I'm struggling to understand.

I'm used to being able to assume that if I can iterate through something with a for loop I can call on any one element with something like text_children[0]. That doesn't seem to be the case with an iterator. If I create a list with:

my_array = ["one","two","three"] 

I can use my_array[1] to see the second item in the array. If I try to do text_children[1] I get an error:

TypeError: 'listiterator' object is not subscriptable

How do I get at the contents of an iterator?

You can easy construct a list from the iterator by:

my_list = list(your_generator)

Now you can subscript the elements:

print(my_list[1])

another way to get the value is by using next . This will pull the next value from the iterator, but as you've already discovered, once you pull a value out of the iterator, you can't always put it back in (whether or not you can put it back in depends entirely on the object that is being iterated over and what its next method actually looks like).

The reason for this is that often you just want an object that you can iterate over. iterators are great for that as they calculate the elements 1 at a time rather than needing to store all of the values. In other words, you only have one element from the iterator consuming your system's memory at a time -- vs. a list or a tuple where all of the elements are typically stored in memory before you start iterating.

I try to work out a more general answer:

  • An iterable is an object which can be iterated over. These include lists, tuples, etc. On request, they give an iterator.

  • An iterator is an object which is used for iteration. It gives a value on each request, and if it is over, it is over. These are generators, list iterators etc., but also eg file objects. Every iterator is iterable and gives itself as its iterator.

Example:

a = []
b = iter(a)
print a, b # -> [] <listiterator object at ...>

If you do

for i in a: ...

a is asked for an iterator via its __iter__() method and this iterator is then queried for the next elements until exhausted. This happens via the .next() (resp. __next__() in 3.x) method.

Indexing is a completely different thing. As iteration can happen via indexing if the object doesn't have an .__iter__() method, every indexable object is iterable, but not vice versa.

the short answer, as stated before me, is to just create a list from your generator.

like so: list(generator)

the long answer, and the explanation as to why:

when you create a generator, or in your case a 'listiterator' which is a generator that beautiful soup uses, you are not really creating a list of items. you are creating an object (generator) which knows how to iterate through a certain amount of items, one at a time, ( next() )

what that means.

instead of what you want which is lets say, a book with pages.

you get a typewriter.

the typewriter can create a book with pages, but only 1 page at a time. now, if you just start at the begining and look at them one at a time like a for loop, then yes, its almost like reading a normal book.

but unlike a normal book, once the typewriter is finished with a page, you cant go backwards, that page is now gone.

i hope this makes some sense.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM