简体   繁体   English

Word2Vec / Doc2Vec 培训失败:提供的示例计数 (0) 不等于预期计数

[英]Word2Vec / Doc2Vec training fails: Supplied example count (0) did not equal expected count

I am learning Word2Vec and was trying to replicate a Word2Vec model from my textbook.我正在学习 Word2Vec,并试图从我的课本中复制 Word2Vec model。 Unlike what the textbook shows, however, my model gives a warning saying that supplied example count (0) did not equal expected count (2381) .然而,与教科书显示的不同,我的 model 给出了一条警告,指出supplied example count (0) did not equal expected count (2381) Apparently, my model was not trained at all.显然,我的 model 根本没有受过训练。 The corpus I fed to the model was apparently an re-usable iterator (it was a list) as it passed this test:我提供给 model 的语料库显然是一个可重复使用的迭代器(它是一个列表),因为它通过了这个测试:

>>> print(sum(1 for _ in corpus))
>>> print(sum(1 for _ in corpus))
>>> print(sum(1 for _ in corpus))

2381
2381
2381

I tried with gensim 3.6 and gensim 4.3, and both versions gave me the same warning.我尝试使用 gensim 3.6 和 gensim 4.3,这两个版本都给了我同样的警告。 Here is a code snippet I used with gensim 3.6:这是我在 gensim 3.6 中使用的代码片段:

word2vec_model = Word2Vec(size = 300, window=5, min_count = 2, workers = -1)
word2vec_model.build_vocab(corpus)
word2vec_model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin.gz', lockf=1.0, binary=True)
word2vec_model.train(corpus, total_examples = word2vec_model.corpus_count, epochs = 15)

This is the warning message:这是警告消息:

WARNING:gensim.models.base_any2vec:EPOCH - 1 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 2 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 3 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 4 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 5 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 6 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 7 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 8 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 9 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 10 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 11 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 12 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 13 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 14 : supplied example count (0) did not equal expected count (2381)
WARNING:gensim.models.base_any2vec:EPOCH - 15 : supplied example count (0) did not equal expected count (2381)
(0, 0)

I tried to train a different model with Doc2Vec with different corpus that is in the form of TaggedDocument, it gave me the same warning message.我试图用 Doc2Vec 训练不同的 model 和不同的 TaggedDocument 形式的语料库,它给了我同样的警告信息。

Gensim's Word2Vec & Doc2Vec (& related models) don't take a workers=-1 value. Gensim 的Word2VecDoc2Vec (及相关模型)不采用workers=-1值。 You have to set a specific count of worker threads.您必须设置特定数量的工作线程。

Setting -1 means no threads, and then the no-training situation you've observed.设置-1表示没有线程,然后是您观察到的无训练情况。 (There might be some better messaging of what's gone wrong in the latest Gensim or with loggin to at least the INFO level.) (在最新的 Gensim 中或至少登录到 INFO 级别时,可能会出现一些更好的错误消息。)

Generally the worker count should never be higher than the number of CPU cores – but also, when training using a corpus iterable on a machine with more than 8 cores, optimal throughput is more likely to be reached in the 6-12 thread range than anything higher, because of some contention/bottlnecking in the single-reader-thread, fan-out-to-many-workers approach Gensim uses, and the Python "GIL".通常, worker数量永远不应高于 CPU 核心数量——而且,当在超过 8 个核心的机器上使用可迭代语料库进行训练时,在 6-12 线程范围内比任何其他方法都更有可能达到最佳吞吐量更高,因为在单读者线程中存在一些争用/瓶颈,Gensim 使用扇出到许多工人的方法,以及 Python“GIL”。

Unfortunately, the exact best throughput value will vary based on your other parameters, especially window and vector_size and negative , and can only be found via trial-and-error.不幸的是,确切的最佳吞吐量值会根据您的其他参数而有所不同,尤其是windowvector_sizenegative ,并且只能通过反复试验找到。 I often start with 6 on an 8-core machine, and 12 on any machine with 16 or more cores.我经常在 8 核机器上从 6 开始,在任何 16 核或更多内核的机器上从 12 开始。 (Another key tip is to make sure your corpus iterable is doing as little as possible – such as reading a pre-tokenized file from disk, rather than doing any other preprocessing every iteration, in the main thread.) (另一个关键提示是确保你的可迭代语料库做的事情尽可能少——比如从磁盘读取一个预先标记的文件,而不是在主线程中每次迭代都做任何其他预处理。)

If you can get all your text from a pretokenized text file, you can also consider the corpus_file mode, which lets each worker read its own unique range of the file, and thus better achieves maximum throughput by setting workers to the number of cores.如果你可以从预标记文本文件中获取所有文本,你还可以考虑corpus_file模式,它让每个 worker 读取自己独特的文件范围,从而通过将 worker 设置为核心数来更好地实现最大吞吐量。

Separate tips:单独提示:

  • A min_count=2 value so low usually hurts word2vec results: rare words don't learn good representation for themselves from a small number of usage examples, but can in aggregate dilute/interfere-with other words. min_count=2值如此之低通常会损害 word2vec 结果:稀有词不会从少量使用示例中为自己学习良好的表示,但可以在总体上稀释/干扰其他词。 Discarding more rare words, as the size of the corpus allows, often improves all surviving words enough to improve overall downstream evaluations.在语料库的大小允许的情况下,丢弃更多的稀有词通常会改善所有幸存的词,足以改善整体下游评估。

  • .intersect_word2vec_format() is an advanced/experimental option with no sure best practices; .intersect_word2vec_format()是一个高级/实验性选项,没有确定的最佳实践; try to understand what it does from the source code, and the weird ways it changes the usual SGD tradeoffs, before trying it – and be sure to run extra checks that it's doing what you want over more typical approaches.尝试从源代码中了解它的作用,以及它改变通常的 SGD 权衡的奇怪方式,然后再尝试 - 并且一定要运行额外的检查以确保它在更典型的方法上正在做你想做的事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM