简体   繁体   中英

Batch-train word2vec in gensim with support of multiple workers

Context

There exists severals questions about how to train Word2Vec using gensim with streamed data. Anyhow, these questions don't deal with the issue that streaming cannot use multiple workers since there is no array to split between threads.

Hence I wanted to create a generator providing such functionality for gensim. My results look like:

from gensim.models import Word2Vec as w2v

#The data is stored in a python-list and unsplitted.
#It's too much data to store it splitted, so I have to do the split while streaming.
data = ['this is document one', 'this is document two', ...]

#Now the generator-class
import threading

class dataGenerator:
    """
    Generator for batch-tokenization.
    """

    def __init__(self, data: list, batch_size:int = 40):
        """Initialize generator and pass data."""

        self.data = data
        self.batch_size = batch_size
        self.lock = threading.Lock()


    def __len__(self):
        """Get total number of batches."""
        return int(np.ceil(len(self.data) / float(self.batch_size)))


    def __iter__(self) -> list([]):
        """
        Iterator-wrapper for generator-functionality (since generators cannot be used directly).
        Allows for data-streaming.
        """
        for idx in range(len(self)):
            yield self[idx]


    def __getitem__(self, idx):

        #Make multithreading thread-safe
        with self.lock:

            # Returns current batch by slicing data.
            return [arr.split(" ") for arr in self.data[idx * self.batch_size : (idx + 1) * self.batch_size]]


#And now do the training
model = w2v(
             sentences=dataGenerator(data),
             size=300,
             window=5,
             min_count=1,
             workers=4
            )

This results in the error

TypeError: unhashable type: 'list'

Since dataGenerator(data) would work if I'd just yield a single splitted document, I assume that gensims word2vec wraps the generator within an extra list. In this case the __iter__ would look like:

def __iter__(self) -> list:
    """
    Iterator-wrapper for generator-functionality (since generators cannot be used directly.
    Allows for data-streaming.
    """
    for text in self.data:
        yield text.split(" ")

Hence, my batch would also be wrapped resulting in something like [[['this', '...'], ['this', '...']], [[...], [...]]] (=> list of list of list) which cannot be processed by gensim.




My question:

Can I "stream"-pass batches in order to use multiple workers? How can I change my code accordingly?

It seems I was too impatient. I ran the streaming-function written above which processes only one document instead of a batch:

def __iter__(self) -> list:
    """
    Iterator-wrapper for generator-functionality (since generators cannot be used directly.
    Allows for data-streaming.
    """
    for text in self.data:
        yield text.split(" ")

After starting the w2v -function it took about ten minutes until all cores were working correctly.

It seems that building the vocabulary does not support multiple cores and, hence, only one was used for this task. Presumably, it took so long because auf the corpus-size. After gensim built the vocab, all cores were used for the training.

So if you are running in this issue as well, maybe some patience will already help:)

Just want to reiterate that @gojomo's comment is the way to go: with a large corpus and multiple cpus, it's much faster to train gensim word2vec using the corpus_file parameter instead of sentences , as mentioned in the docs :

  • corpus_file (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (or none of them, in that case, the model is left uninitialized).

LineSentence format is basically just one sentence per line, with words space-separated. Plain text, .bz2 or gz.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM