简体   繁体   English

我如何获得迭代器的内容?

[英]How do I get at the contents of an iterator?

I'm thoroughly puzzled. 我很困惑。 I have a block of HTML that I scraped out of a larger table. 我从较大的表格中抓取了一部分HTML。 It looks about like this: 它看起来像这样:

<td align="left" class="page">Number:\xc2\xa0<a class="topmenu" href="http://www.example.com/whatever.asp?search=724461">724461</a> Date:\xc2\xa01/1/1999 Amount:\xc2\xa0$2.50 <br/>Person:<br/><a class="topmenu" href="http://www.example.com/whatever.asp?search=LAST&amp;searchfn=FIRST">LAST,\xc2\xa0FIRST </a> </td>

(Actually, it looked worse, but I regexed out a lot of line breaks) (实际上,它看起来更糟,但我换了很多换行符)

I need to get the lines out, and break up the Date/Amount line. 我需要删除所有行,然后拆分“日期/金额”行。 It seemed like the place to start was to find the children of that block of HTML. 似乎开始的地方是找到该HTML块的子代。 The block is a string because that's how regex gave it back to me. 该块是一个字符串,因为正则表达式就是这样把它还给我的。 So I did: 所以我做了:

text_soup = BeautifulSoup(text)
text_children = text_soup.find('td').childGenerator()

I've worked out that I can only iterate through text_children once , though I don't understand why that is. 我已经得出结论, 我只能对text_children一次迭代 ,尽管我不知道为什么这样做。 It's a listiterator type, which I'm struggling to understand. 这是一个listiterator类型,我很难理解。

I'm used to being able to assume that if I can iterate through something with a for loop I can call on any one element with something like text_children[0]. 我习惯于假设,如果我可以使用for循环遍历某些内容,则可以使用诸如text_children [0]之类的任何元素来调用它。 That doesn't seem to be the case with an iterator. 迭代器似乎并非如此。 If I create a list with: 如果我使用以下方法创建列表:

my_array = ["one","two","three"] 

I can use my_array[1] to see the second item in the array. 我可以使用my_array[1]查看数组中的第二项。 If I try to do text_children[1] I get an error: 如果我尝试执行text_children[1]收到错误消息:

TypeError: 'listiterator' object is not subscriptable

How do I get at the contents of an iterator? 我如何获得迭代器的内容?

You can easy construct a list from the iterator by: 您可以通过以下方式轻松地从迭代器构造列表:

my_list = list(your_generator)

Now you can subscript the elements: 现在,您可以对元素进行下标:

print(my_list[1])

another way to get the value is by using next . 另一种获取值的方法是使用next This will pull the next value from the iterator, but as you've already discovered, once you pull a value out of the iterator, you can't always put it back in (whether or not you can put it back in depends entirely on the object that is being iterated over and what its next method actually looks like). 这将从迭代器中提取下一个值,但是,正如您已经发现的那样,一旦将值从迭代器中提取出来,就无法始终将其放回去(是否可以将其放回去完全取决于被迭代的对象及其next方法的实际外观)。

The reason for this is that often you just want an object that you can iterate over. 这样做的原因是,通常您只需要可以迭代的对象。 iterators are great for that as they calculate the elements 1 at a time rather than needing to store all of the values. 迭代器非常有用,因为它们一次计算元素1而不需要存储所有值。 In other words, you only have one element from the iterator consuming your system's memory at a time -- vs. a list or a tuple where all of the elements are typically stored in memory before you start iterating. 换句话说,迭代器中只有一个元素一次消耗系统的内存,而列表或元组通常在开始迭代之前将所有元素存储在内存中。

I try to work out a more general answer: 我尝试得出一个更一般的答案:

  • An iterable is an object which can be iterated over. 可迭代对象是可以迭代的对象。 These include lists, tuples, etc. On request, they give an iterator. 这些包括列表,元组等。根据请求,它们提供迭代器。

  • An iterator is an object which is used for iteration. 迭代器是其用于迭代的对象。 It gives a value on each request, and if it is over, it is over. 它为每个请求提供一个值,如果结束,则结束。 These are generators, list iterators etc., but also eg file objects. 这些是生成器,列表迭代器等,但也有文件对象。 Every iterator is iterable and gives itself as its iterator. 每个迭代器都是可迭代的,并将自身作为其迭代器。

Example: 例:

a = []
b = iter(a)
print a, b # -> [] <listiterator object at ...>

If you do 如果你这样做

for i in a: ...

a is asked for an iterator via its __iter__() method and this iterator is then queried for the next elements until exhausted. 通过其__iter__()方法请求一个迭代器,然后查询该迭代器以获取下一个元素,直到耗尽为止。 This happens via the .next() (resp. __next__() in 3.x) method. 这是通过.next() (在3.x中为__next__() )方法发生的。

Indexing is a completely different thing. 索引编制是完全不同的事情。 As iteration can happen via indexing if the object doesn't have an .__iter__() method, every indexable object is iterable, but not vice versa. 如果对象没有.__iter__()方法,则可以通过索引进行迭代,因此每个可索引的对象都是可迭代的,但反之亦然。

the short answer, as stated before me, is to just create a list from your generator. 如前所述,最简单的答案是从您的生成器创建一个列表。

like so: list(generator) 像这样: list(generator)

the long answer, and the explanation as to why: 长答案,以及有关原因的解释:

when you create a generator, or in your case a 'listiterator' which is a generator that beautiful soup uses, you are not really creating a list of items. 当您创建一个生成器,或者您创建一个“ listiterator”(一个漂亮的汤使用的生成器)时,您实际上并不是在创建项目列表。 you are creating an object (generator) which knows how to iterate through a certain amount of items, one at a time, ( next() ) 您正在创建一个对象(生成器),该对象知道如何迭代一定数量的项目,一次迭代一次,( next()

what that means. 那是什么意思。

instead of what you want which is lets say, a book with pages. 而不是说一本书,而不是您想要的。

you get a typewriter. 你会得到一台打字机。

the typewriter can create a book with pages, but only 1 page at a time. 打字机可以创建一个有页的书,但一次只能一页。 now, if you just start at the begining and look at them one at a time like a for loop, then yes, its almost like reading a normal book. 现在,如果您只是从头开始,并且像for循环一样一次查看它们,那么是的,这几乎就像读一本普通的书一样。

but unlike a normal book, once the typewriter is finished with a page, you cant go backwards, that page is now gone. 但是与普通书籍不同的是,一旦打字机完成一页纸,您就不能向后退,该页现在不见了。

i hope this makes some sense. 我希望这有道理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM