简体   繁体   中英

Embedding multiword ngram phrases with PathLineSentences in gensim word2vec

I have around 82 gzipped files (around 180MB each and 14GB total) where each file contains new line separated sentences. I am thinking of using PathLineSentences from gensim Word2Vec to train word2vec model on the vocabularies. In that way I do not have to worry about taking all the sentences list into the RAM.

Now I also wanted to get the embedding to include multiword phrases. But from the documentation , it seems that I need to have an already trained phrase detector an all the sentences I have eg

from gensim.models import Phrases
# Train a bigram detector.
bigram_transformer = Phrases(all_sentences)
# Apply the trained MWE detector to a corpus, using the result to train a Word2vec model.
model = Word2Vec(bigram_transformer[all_sentences], min_count=1)

Now, I have two questions:

  1. Is there any way I can do the Phrase Detection while running the Word2Vec on top of each of the individual files in a streaming manner?
  2. If not, is there any way I can do the initial phrase detection in the similar fashion of PathLineSentences, as in doing the phrase detection in a streaming manner?

The Gensim Phrases class will accept data in the exact same form as Word2Vec : an iterable of all the tokenized texts.

You can provide that both as the initial training corpus, then as the corpus to be transformed into paired bigrams.

However, I would highly suggest that you not try to do the phrase-combinations in a simultaneous stream as feeding to Word2Vec , for both clarity and efficiency reasons.

Instead, do the transformation once, writing the results to a new, single corpus file. Then:

  • you can easily review the results of the bigram-combinations
  • the pair-by-pair calculations that decide which words will be combined will be done only once, creating a simple corpus of space-delimited tokens. (Otherwise, each of the epochs + 1 passes done by `Word2Vec will need to repeat the same calculations.)

Roughly that'd look like:

with open('corpus.txt', 'w') as of:
    for phrased_sentence in bigram_transformer[all_sentences]:
        of.write(' '.join(phrased_sentence)
        of.write('\n')

(You could instead write to a gzipped file like corpus.txt.gz instead, using GzipFile or smart_open 's gzip functionality, if you'd like.)

Then the new file shows you exact data Word2Vec is operating on, and can be fed as a simple corpus - wrapped as an iterable with LineSentence or even passed using the corpus_file option that can better use more workers threads.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM