在 gensim 中批量訓練 word2vec，支持多個工作人員

Question

語境

關於如何使用帶有流數據的gensim訓練Word2Vec存在幾個問題。 無論如何，這些問題並沒有解決流不能使用多個工作線程的問題，因為沒有數組可以在線程之間拆分。

因此，我想創建一個為 gensim 提供此類功能的生成器。 我的結果如下：

from gensim.models import Word2Vec as w2v

#The data is stored in a python-list and unsplitted.
#It's too much data to store it splitted, so I have to do the split while streaming.
data = ['this is document one', 'this is document two', ...]

#Now the generator-class
import threading

class dataGenerator:
    """
    Generator for batch-tokenization.
    """

    def __init__(self, data: list, batch_size:int = 40):
        """Initialize generator and pass data."""

        self.data = data
        self.batch_size = batch_size
        self.lock = threading.Lock()


    def __len__(self):
        """Get total number of batches."""
        return int(np.ceil(len(self.data) / float(self.batch_size)))


    def __iter__(self) -> list([]):
        """
        Iterator-wrapper for generator-functionality (since generators cannot be used directly).
        Allows for data-streaming.
        """
        for idx in range(len(self)):
            yield self[idx]


    def __getitem__(self, idx):

        #Make multithreading thread-safe
        with self.lock:

            # Returns current batch by slicing data.
            return [arr.split(" ") for arr in self.data[idx * self.batch_size : (idx + 1) * self.batch_size]]


#And now do the training
model = w2v(
             sentences=dataGenerator(data),
             size=300,
             window=5,
             min_count=1,
             workers=4
            )

這會導致錯誤

類型錯誤：不可散列類型：“列表”

由於如果我只生成一個拆分文檔， dataGenerator(data)就可以工作，我假設 gensims word2vec將生成器包裝在一個額外的列表中。 在這種情況下， __iter__看起來像：

def __iter__(self) -> list:
    """
    Iterator-wrapper for generator-functionality (since generators cannot be used directly.
    Allows for data-streaming.
    """
    for text in self.data:
        yield text.split(" ")

因此，我的批次也會被包裝成類似[[['this', '...'], ['this', '...']], [[...], [...]]] (=> list of list) 無法由 gensim 處理。

我的問題：

我可以“流”通過批次以使用多個工人嗎？ 如何相應地更改我的代碼？

Answer 1

看來我太不耐煩了。 我運行了上面編寫的流函數，它只處理一個文檔而不是批處理：

def __iter__(self) -> list:
    """
    Iterator-wrapper for generator-functionality (since generators cannot be used directly.
    Allows for data-streaming.
    """
    for text in self.data:
        yield text.split(" ")

啟動w2v功能后，大約需要十分鍾，直到所有內核都正常工作。

似乎構建詞匯表不支持多核，因此，只有一個用於此任務。 據推測，它花了這么長時間，因為 auf 語料庫大小。 gensim 構建詞匯后，所有核心都用於訓練。

所以如果你也在這個問題上運行，也許一些耐心會有所幫助:)

Answer 2

只是想重申@gojomo 的評論是通往 go 的方式：使用大型語料庫和多個 cpus，使用corpus_file參數而不是sentences訓練 gensim word2vec要快得多，如文檔中所述：

corpus_file (str, optional) -- LineSentence格式的語料庫文件的路徑。 您可以使用此參數而不是句子來提高性能。 只有一個句子或 corpus_file arguments 需要傳遞（或者它們都不需要傳遞，在這種情況下，model 未初始化）。

LineSentence 格式基本上每行只有一個句子，單詞以空格分隔。 純文本、.bz2 或 gz。

在 gensim 中批量訓練 word2vec，支持多個工作人員

問題描述

2 個解決方案

解決方案1
1 已采納 2019-11-12 16:22:22

解決方案2
0 2020-07-17 23:38:48

在 gensim 中批量訓練 word2vec，支持多個工作人員

問題描述

2 個解決方案

解決方案1 1 已采納 2019-11-12 16:22:22

解決方案2 0 2020-07-17 23:38:48

解決方案1
1 已采納 2019-11-12 16:22:22

解決方案2
0 2020-07-17 23:38:48