简体   繁体   English

如何加载预训练的doc2vec模型并使用其向量

[英]How to load the pre-trained doc2vec model and use it's vectors

Does anyone know which function should I use if I want to use the pre-trained doc2vec models in this website https://github.com/jhlau/doc2vec ? 如果我想在此网站https://github.com/jhlau/doc2vec中使用预先训练的doc2vec模型,有人知道我应该使用哪个功能?

I know we can use the Keyvectors.load_word2vec_format() to laod the word vectors from pre-trained word2vec models, but do we have a similar function to load pre-trained doc2vec models as well in gensim? 我知道我们可以使用Keyvectors.load_word2vec_format()从预先训练的word2vec模型中提取单词向量,但是我们是否也具有类似的功能来在gensim中加载预先训练的doc2vec模型?

Thanks a lot. 非常感谢。

When a model like Doc2Vec is saved with gensim's native save() , it can be reloaded with the native load() method: 当使用gensim的本机save()保存类似Doc2Vec的模型时,可以使用本机load()方法重新加载该模型:

model = Doc2Vec.load(filename)

Note that large internal arrays may have been saved alongside the main filename, in other filenames with extra extensions – and all those files must be kept together to re-load a fully-functional model. 请注意,大型内部数组可能已经与主文件名以及其他带有扩展名的文件名一起保存了,并且所有这些文件必须保存在一起才能重新加载功能齐全的模型。 (You still need to specify only the main save file, and the auxiliary files will be discovered at expected names alongside it in the same directory.) (您仍然只需要指定主保存文件,辅助文件将以预期的名称在同一目录中被发现。)

You may have other issues trying to use those pre-trained models. 您可能在尝试使用那些预先训练的模型时遇到其他问题。 In particular: 尤其是:

  • as noted in the linked page, the author used a custom variant of gensim that forked off about 2 years ago; 如链接页面所述,作者使用了gensim的自定义变体,该变体大约在2年前出现。 the files might not load in standard gensim, or later gensims 文件可能无法在标准gensim或更高版本的gensim中加载

  • it's not completely clear what parameters were used to train those models (though I suppose if you succeed in loading them you could see them as properties in the model), and how much meta-optimization was used for which purposes, and whether those purposes will match your own project 尚不清楚使用什么参数来训练那些模型(尽管我想如果成功加载它们,您可以将它们视为模型中的属性),以及针对该目的使用了多少元优化,以及这些目的是否会符合您自己的项目

  • if the parameters are as shown in one of the repo files, [train_model.py][1] , some are inconsistent with best practices (a min_count=1 is usually bad for Doc2Vec ) or apparent model-size (a mere 1.4GB model couldn't hold 300-dimensional vectors for all of the millions of documents or word-tokens in 2015 Wikipedia) 如果参数在回购文件中的一个如图所示, [train_model.py][1]有些是与最佳实践不一致(一个min_count=1是通常为坏Doc2Vec )或表观模型尺寸(仅1.4GB模型无法在2015年Wikipedia中容纳数百万个文档或单词令牌的所有300维矢量)

I would highly recommend training your own model, on a corpus you understand, with recent code, and using metaparameters optimized for your own purposes. 我强烈建议您在理解的语料库上使用最新代码并使用针对自己的目的而优化的元参数来训练自己的模型。

Try this: 尝试这个:

import gensim.models as g

model="model_folder/doc2vec.bin"  #point to downloaded pre-trained doc2vec model

#load model
m = g.Doc2Vec.load(model)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM