I have around 82 gzipped files (around 180MB each and 14GB total) where each file contains new line separated sentences. I am thinking of using PathLineSentences from gensim Word2Vec to train word2vec model on the vocabularies. In that way I do not have to worry about taking all the sentences list into the RAM.
Now I also wanted to get the embedding to include multiword phrases. But from the documentation , it seems that I need to have an already trained phrase detector an all the sentences I have eg
from gensim.models import Phrases
# Train a bigram detector.
bigram_transformer = Phrases(all_sentences)
# Apply the trained MWE detector to a corpus, using the result to train a Word2vec model.
model = Word2Vec(bigram_transformer[all_sentences], min_count=1)
Now, I have two questions:
The Gensim Phrases
class will accept data in the exact same form as Word2Vec
: an iterable of all the tokenized texts.
You can provide that both as the initial training corpus, then as the corpus to be transformed into paired bigrams.
However, I would highly suggest that you not try to do the phrase-combinations in a simultaneous stream as feeding to Word2Vec
, for both clarity and efficiency reasons.
Instead, do the transformation once, writing the results to a new, single corpus file. Then:
epochs + 1
passes done by `Word2Vec will need to repeat the same calculations.) Roughly that'd look like:
with open('corpus.txt', 'w') as of:
for phrased_sentence in bigram_transformer[all_sentences]:
of.write(' '.join(phrased_sentence)
of.write('\n')
(You could instead write to a gzipped file like corpus.txt.gz
instead, using GzipFile
or smart_open
's gzip functionality, if you'd like.)
Then the new file shows you exact data Word2Vec
is operating on, and can be fed as a simple corpus - wrapped as an iterable with LineSentence
or even passed using the corpus_file
option that can better use more workers
threads.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.