简体   繁体   English

生成器不是迭代器?

[英]Generator is not an iterator?

I have an generator (a function that yields stuff), but when trying to pass it to gensim.Word2Vec I get the following error: 我有一个生成器(一个产生东西的函数),但是当试图将它传递给gensim.Word2Vec我收到以下错误:

TypeError: You can't pass a generator as the sentences argument. TypeError:您不能将生成器作为句子参数传递。 Try an iterator. 尝试一个迭代器。

Isn't a generator a kind of iterator? 发生器不是一种迭代器吗? If not, how do I make an iterator from it? 如果没有,我如何从中制作迭代器?

Looking at the library code, it seems to simply iterate over sentences like for x in enumerate(sentences) , which works just fine with my generator. 看一下库代码,它似乎只是简单地迭代for x in enumerate(sentences)这样for x in enumerate(sentences) ,这对我的生成器来说效果很好。 What is causing the error then? 那么是什么导致错误?

Generator is exhausted after one loop over it. 发电机经过一次回路后耗尽 Word2vec simply needs to traverse sentences multiple times (and probably get item for a given index, which is not possible for generators which are just a kind of stacks where you can only pop), thus requiring something more solid, like a list. Word2vec只需要多次遍历句子(并且可能获得给定索引的项目,这对于只能是一种只能弹出的堆栈的生成器是不可能的),因此需要更加可靠的东西,如列表。

In particular in their code they call two different functions, both iterate over sentences (thus if you use generator, the second one would run on an empty set) 特别是在他们的代码中,他们调用两个不同的函数,都迭代句子(因此,如果你使用生成器,第二个将在空集上运行)

self.build_vocab(sentences, trim_rule=trim_rule)
self.train(sentences)

It should work with anything implementing __iter__ which is not GeneratorType . 它应该适用于实现__iter__且不是GeneratorType任何东西。 So wrap your function in an iterable interface and make sure that you can traverse it multiple times, meaning that 因此,将您的函数包装在一个可迭代的接口中,并确保您可以多次遍历它,这意味着

sentences = your_code
for s in sentences:
  print s
for s in sentences:
  print s

prints your collection twice 打印您的收藏两次

As previous posters are mentioned, generator acts similarly to iterator with two significant differences: generators get exhausted, and you can't index one. 正如前面提到的海报一样,生成器与迭代器的行为类似,但有两个显着的区别:生成器耗尽,而你不能索引一个。

I quickly looked up the documentation, on this page -- https://radimrehurek.com/gensim/models/word2vec.html 我在这个页面上快速查阅了文档 - https://radimrehurek.com/gensim/models/word2vec.html

The documentation states that 文件说明了这一点

gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0, seed=1, workers=1, min_alpha=0.0001, sg=1, hs=1, negative=0, cbow_mean=0, hashfxn=, iter=1, null_word=0, trim_rule=None, sorted_vocab=1) ... gensim.models.word2vec.Word2Vec(句子=无,大小= 100,alpha = 0.025,窗口= 5,min_count = 5,max_vocab_size =无,样本= 0,种子= 1,工人= 1,min_alpha = 0.0001,sg = 1,hs = 1,负= 0,cbow_mean = 0,hashfxn =,iter = 1,null_word = 0,trim_rule = None,sorted_vocab = 1) ...

Initialize the model from an iterable of sentences. 从可迭代的句子初始化模型。 Each sentence is a list of words (unicode strings) that will be used for training. 每个句子都是将用于训练的单词(unicode字符串)列表。

I'm venture to guess that the logic inside of the function inherently requires one or more list properties such as item indexing, there might be an explicit assert statement or if statement that raises an error. 我冒昧地猜测函数内部的逻辑固有地需要一个或多个列表属性,例如项索引,可能有一个显式断言语句或if语句引发错误。

A simple hack that can solve your problem is turning your generator into list comprehension. 一个可以解决您的问题的简单黑客就是将您的生成器转换为列表理解。 Your program is going to sustain CPU performance penalty and will increase its memory usage, but this should at least make the code work. 您的程序将维持CPU性能损失,并将增加其内存使用量,但这至少应该使代码工作。

my_iterator = [x for x in generator_obj]

It seems gensim throws a misleading error message. 似乎gensim会抛出一个误导性的错误信息。

Gensim wants to iterate over your data multiple times. Gensim希望多次迭代您的数据。 Most libraries just build a list from the input, so the user doesn't have to care about supplying a multiple iterable sequence. 大多数库只是根据输入构建一个列表,因此用户不必关心提供多个可迭代序列。 Of course, generating an in-memory list can be very resource-consuming, while iterating over a file for example, can be done without storing the whole file in memory. 当然,生成内存列表可能非常消耗资源,例如,可以在不将整个文件存储在内存中的情况下迭代文件。

In your case, just changing the generator to a list comprehesion should solve the problem. 在您的情况下,只需将生成器更改为列表comprehesion即可解决问题。

Other answers have pointed out that Gensim requires two passes to build the Word2Vec model: once to build the vocabulary ( self.build_vocab ), and the second to train the model ( self.train ). 其他答案指出,Gensim需要两次传递来构建Word2Vec模型:一次构建词汇表( self.build_vocab ),第二次训练模型( self.train )。 You can still pass a generator to the train method (eg, if you're streaming data) by breaking apart the build_vocab and train methods. 您仍然可以通过拆分build_vocabtrain方法将生成器传递给train方法(例如,如果您正在传输数据)。

from gensim.models import Word2Vec

model = Word2Vec()
sentences = my_generator()  # first pass
model.build_vocab(sentences)

sentences = my_generator()  # second pass of same data
model.train(sentences2, 
            total_examples=num_sentences,  # total number of documents to process
            epochs=model.epochs)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM