简体   繁体   English

为 doc2vec 加载预训练的 word2vec model

[英]load pre-trained word2vec model for doc2vec

I'm using gensim to extract feature vector from a document.我正在使用 gensim 从文档中提取特征向量。 I've downloaded the pre-trained model from Google named GoogleNews-vectors-negative300.bin and I loaded that model using the following command:我已经从 Google 下载了名为GoogleNews-vectors-negative300.bin的预训练 model,并使用以下命令加载了该 model:

model = models.Doc2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

My purpose is to get a feature vector from a document.我的目的是从文档中获取特征向量。 For a word, it's very easy to get the corresponding vector:一句话,很容易得到对应的向量:

vector = model[word]

However, I don't know how to do it for a document.但是,我不知道如何为文档做这件事。 Could you please help?能否请你帮忙?

A set of word vectors (such as GoogleNews-vectors-negative300.bin ) is neither necessary nor sufficient for the kind of text vectors (Le/Mikolov 'Paragraph Vectors') created by the Doc2Vec class.一组词向量(例如GoogleNews-vectors-negative300.bin )对于 Doc2Vec 类创建的文本向量类型(Le/Mikolov 'Paragraph Vectors')既不必要也不充分。 It instead expects to be trained with example texts to learn per-document vectors.相反,它希望使用示例文本进行训练以学习每个文档的向量。 Then, also, the trained model can be used to 'infer' vectors for other new documents.然后,还可以使用经过训练的模型来“推断”其他新文档的向量。

(The Doc2Vec class only supports the load_word2vec_format() method because it inherits from the Word2Vec class – not because it needs that functionality.) (Doc2Vec 类只支持load_word2vec_format()方法,因为它继承自 Word2Vec 类——而不是因为它需要该功能。)

There's another simple kind of text vector that can be created by simply averaging all the words in the document, perhaps also according to some per-word significance weighting.还有另一种简单的文本向量可以通过简单地平均文档中的所有单词来创建,也许还可以根据一些每个单词的重要性加权。 But that's not what Doc2Vec provides.但这不是 Doc2Vec 提供的。

I tried this:我试过这个:

 model = models.Doc2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

and it is giving me an error that doc to vec does not contain any word2vec format.它给我一个错误,即 doc to vec 不包含任何 word2vec 格式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM