在 gensim 中批量训练 word2vec，支持多个工作人员

Question

Context语境

There exists severals questions about how to train Word2Vec using gensim with streamed data.关于如何使用带有流数据的gensim训练Word2Vec存在几个问题。 Anyhow, these questions don't deal with the issue that streaming cannot use multiple workers since there is no array to split between threads.无论如何，这些问题并没有解决流不能使用多个工作线程的问题，因为没有数组可以在线程之间拆分。

Hence I wanted to create a generator providing such functionality for gensim.因此，我想创建一个为 gensim 提供此类功能的生成器。 My results look like:我的结果如下：

from gensim.models import Word2Vec as w2v

#The data is stored in a python-list and unsplitted.
#It's too much data to store it splitted, so I have to do the split while streaming.
data = ['this is document one', 'this is document two', ...]

#Now the generator-class
import threading

class dataGenerator:
    """
    Generator for batch-tokenization.
    """

    def __init__(self, data: list, batch_size:int = 40):
        """Initialize generator and pass data."""

        self.data = data
        self.batch_size = batch_size
        self.lock = threading.Lock()


    def __len__(self):
        """Get total number of batches."""
        return int(np.ceil(len(self.data) / float(self.batch_size)))


    def __iter__(self) -> list([]):
        """
        Iterator-wrapper for generator-functionality (since generators cannot be used directly).
        Allows for data-streaming.
        """
        for idx in range(len(self)):
            yield self[idx]


    def __getitem__(self, idx):

        #Make multithreading thread-safe
        with self.lock:

            # Returns current batch by slicing data.
            return [arr.split(" ") for arr in self.data[idx * self.batch_size : (idx + 1) * self.batch_size]]


#And now do the training
model = w2v(
             sentences=dataGenerator(data),
             size=300,
             window=5,
             min_count=1,
             workers=4
            )

This results in the error这会导致错误

TypeError: unhashable type: 'list'类型错误：不可散列类型：“列表”

Since dataGenerator(data) would work if I'd just yield a single splitted document, I assume that gensims word2vec wraps the generator within an extra list.由于如果我只生成一个拆分文档， dataGenerator(data)就可以工作，我假设 gensims word2vec将生成器包装在一个额外的列表中。 In this case the __iter__ would look like:在这种情况下， __iter__看起来像：

def __iter__(self) -> list:
    """
    Iterator-wrapper for generator-functionality (since generators cannot be used directly.
    Allows for data-streaming.
    """
    for text in self.data:
        yield text.split(" ")

Hence, my batch would also be wrapped resulting in something like [[['this', '...'], ['this', '...']], [[...], [...]]] (=> list of list of list) which cannot be processed by gensim.因此，我的批次也会被包装成类似[[['this', '...'], ['this', '...']], [[...], [...]]] (=> list of list) 无法由 gensim 处理。

My question:我的问题：

Can I "stream"-pass batches in order to use multiple workers?我可以“流”通过批次以使用多个工人吗？ How can I change my code accordingly?如何相应地更改我的代码？

Answer 1

It seems I was too impatient.看来我太不耐烦了。 I ran the streaming-function written above which processes only one document instead of a batch:我运行了上面编写的流函数，它只处理一个文档而不是批处理：

def __iter__(self) -> list:
    """
    Iterator-wrapper for generator-functionality (since generators cannot be used directly.
    Allows for data-streaming.
    """
    for text in self.data:
        yield text.split(" ")

After starting the w2v -function it took about ten minutes until all cores were working correctly.启动w2v功能后，大约需要十分钟，直到所有内核都正常工作。

It seems that building the vocabulary does not support multiple cores and, hence, only one was used for this task.似乎构建词汇表不支持多核，因此，只有一个用于此任务。 Presumably, it took so long because auf the corpus-size.据推测，它花了这么长时间，因为 auf 语料库大小。 After gensim built the vocab, all cores were used for the training. gensim 构建词汇后，所有核心都用于训练。

So if you are running in this issue as well, maybe some patience will already help:)所以如果你也在这个问题上运行，也许一些耐心会有所帮助:)

Answer 2

Just want to reiterate that @gojomo's comment is the way to go: with a large corpus and multiple cpus, it's much faster to train gensim word2vec using the corpus_file parameter instead of sentences , as mentioned in the docs :只是想重申@gojomo 的评论是通往 go 的方式：使用大型语料库和多个 cpus，使用corpus_file参数而不是sentences训练 gensim word2vec要快得多，如文档中所述：

corpus_file (str, optional) – Path to a corpus file in LineSentence format. corpus_file (str, optional) -- LineSentence格式的语料库文件的路径。 You may use this argument instead of sentences to get performance boost.您可以使用此参数而不是句子来提高性能。 Only one of sentences or corpus_file arguments need to be passed (or none of them, in that case, the model is left uninitialized).只有一个句子或 corpus_file arguments 需要传递（或者它们都不需要传递，在这种情况下，model 未初始化）。

LineSentence format is basically just one sentence per line, with words space-separated. LineSentence 格式基本上每行只有一个句子，单词以空格分隔。 Plain text, .bz2 or gz.纯文本、.bz2 或 gz。

在 gensim 中批量训练 word2vec，支持多个工作人员

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-11-12 16:22:22

解决方案2
0 2020-07-17 23:38:48

在 gensim 中批量训练 word2vec，支持多个工作人员

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-11-12 16:22:22

解决方案2 0 2020-07-17 23:38:48

解决方案1
1 已采纳 2019-11-12 16:22:22

解决方案2
0 2020-07-17 23:38:48