简体   繁体   English

Gensim word2vec model 是否与 Mikolov 的标准 model 相同?

[英]Is the Gensim word2vec model same as the standard model by Mikolov?

I am implementing a paper to compare our performance.我正在实施一篇论文来比较我们的表现。 In the paper, the uathor says在论文中,uathor 说

300-dimensional pre-trained word2vec vectors (Mikolov et al., 2013) 300 维预训练 word2vec 向量 (Mikolov et al., 2013)

I am wondering whether the pretrained word2vec Gensim model here is same as the pretrained embeddings on the official Google site (the GoogleNews-vectors-negative300.bin.gz file)我想知道这里的预训练 word2vec Gensim model 是否与Google 官方网站上的预训练嵌入相同(GoogleNews-vectors-negative300.bin.gz 文件)


My source of doubt arises from this line in Gensim documentation (in Word2Vec Demo section)我的怀疑来自 Gensim 文档中的这一行(在 Word2Vec 演示部分)

We will fetch the Word2Vec model trained on part of the Google News dataset, covering approximately 3 million words and phrases我们将获取在部分 Google 新闻数据集上训练的 Word2Vec model,涵盖大约 300 万个单词和短语

Does this mean the model on gensim is not fully trained?这是否意味着 gensim 上的 model 没有经过充分训练? Is it different from the official embeddings by Mikolov?它与 Mikolov 的官方嵌入有什么不同吗?

That demo code for reading word-vectors is downloading the exact same Google-trained GoogleNews-vectors-negative300 set of vectors.用于读取词向量的演示代码正在下载完全相同的 Google 训练GoogleNews-vectors-negative300向量集。 (No one else can try re-training that dataset, because the original corpus of news articles user, over 100B words of training data from around 2013 if I recall correctly, is internal to Google.) (没有其他人可以尝试重新训练该数据集,因为新闻文章用户的原始语料库,如果我没记错的话,来自 2013 年左右的超过 100B 字的训练数据,是 Google 内部的。)

Algorithmically, the gensim Word2Vec implementation was closely modeled after the word2vec.c code released by Google/Mikolov, so should match its results in measurable respects with regard to any newly-trained vectors.从算法上讲, gensim Word2Vec的实现是在 Google/Mikolov 发布的word2vec.c代码之后紧密建模的,因此对于任何新训练的向量,它的结果应该在可测量的方面匹配。 (Slight differences in threading approaches may have a slight difference.) (线程方法的细微差异可能会略有不同。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM