简体   繁体   中英

Is the Gensim word2vec model same as the standard model by Mikolov?

I am implementing a paper to compare our performance. In the paper, the uathor says

300-dimensional pre-trained word2vec vectors (Mikolov et al., 2013)

I am wondering whether the pretrained word2vec Gensim model here is same as the pretrained embeddings on the official Google site (the GoogleNews-vectors-negative300.bin.gz file)


My source of doubt arises from this line in Gensim documentation (in Word2Vec Demo section)

We will fetch the Word2Vec model trained on part of the Google News dataset, covering approximately 3 million words and phrases

Does this mean the model on gensim is not fully trained? Is it different from the official embeddings by Mikolov?

That demo code for reading word-vectors is downloading the exact same Google-trained GoogleNews-vectors-negative300 set of vectors. (No one else can try re-training that dataset, because the original corpus of news articles user, over 100B words of training data from around 2013 if I recall correctly, is internal to Google.)

Algorithmically, the gensim Word2Vec implementation was closely modeled after the word2vec.c code released by Google/Mikolov, so should match its results in measurable respects with regard to any newly-trained vectors. (Slight differences in threading approaches may have a slight difference.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM