简体   繁体   English

如何将 Gensim doc2vec 与预训练的词向量一起使用?

[英]How to use Gensim doc2vec with pre-trained word vectors?

I recently came across the doc2vec addition to Gensim.我最近遇到了 doc2vec 添加到 Gensim。 How can I use pre-trained word vectors (eg found in word2vec original website) with doc2vec?如何在 doc2vec 中使用预先训练的词向量(例如在 word2vec 原始网站中找到)?

Or is doc2vec getting the word vectors from the same sentences it uses for paragraph-vector training?还是 doc2vec 从用于段落向量训练的相同句子中获取词向量?

Thanks.谢谢。

Note that the "DBOW" ( dm=0 ) training mode doesn't require or even create word-vectors as part of the training.请注意,“DBOW”( dm=0 )训练模式不需要甚至创建词向量作为训练的一部分。 It merely learns document vectors that are good at predicting each word in turn (much like the word2vec skip-gram training mode).它只是学习擅长依次预测每个单词的文档向量(很像 word2vec skip-gram 训练模式)。

(Before gensim 0.12.0, there was the parameter train_words mentioned in another comment, which some documentation suggested will co-train words. However, I don't believe this ever actually worked. Starting in gensim 0.12.0, there is the parameter dbow_words , which works to skip-gram train words simultaneous with DBOW doc-vectors. Note that this makes training take longer – by a factor related to window . So if you don't need word-vectors, you may still leave this off.) (在 gensim 0.12.0 之前,在另一条评论中提到了参数train_words ,一些文档建议将共同训练单词。但是,我不相信这真的有效。从 gensim 0.12.0 开始,有参数dbow_words ,它可以同时使用 DBOW doc-vectors 跳过 gram 训练单词。请注意,这使训练需要更长的时间 - 与window相关的因素。因此,如果您不需要单词向量,您仍然可以将其关闭。 )

In the "DM" training method ( dm=1 ), word-vectors are inherently trained during the process along with doc-vectors, and are likely to also affect the quality of the doc-vectors.在“DM”训练方法( dm=1 )中,word-vectors 在这个过程中与 doc-vectors 一起被固有地训练,并且很可能也会影响 doc-vectors 的质量。 It's theoretically possible to pre-initialize the word-vectors from prior data.理论上可以从先前的数据中预初始化词向量。 But I don't know any strong theoretical or experimental reason to be confident this would improve the doc-vectors.但我不知道有任何强有力的理论或实验理由来相信这会改善文档向量。

One fragmentary experiment I ran along these lines suggested the doc-vector training got off to a faster start – better predictive qualities after the first few passes – but this advantage faded with more passes.我沿着这些路线进行的一个零碎实验表明 doc-vector 训练开始得更快——在前几次传球后预测质量更好——但这种优势随着传球次数的增加而逐渐消失。 Whether you hold the word vectors constant or let them continue to adjust with the new training is also likely an important consideration... but which choice is better may depend on your goals, data set, and the quality/relevance of the pre-existing word-vectors.是保持词向量不变还是让它们随着新训练继续调整也可能是一个重要的考虑因素……但哪种选择更好可能取决于您的目标、数据集以及预先存在的质量/相关性词向量。

(You could repeat my experiment with the intersect_word2vec_format() method available in gensim 0.12.0, and try different levels of making pre-loaded vectors resistant-to-new-training via the syn0_lockf values. But remember this is experimental territory: the basic doc2vec results don't rely on, or even necessarily improve with, reused word vectors.) (您可以使用 gensim 0.12.0 中可用的intersect_word2vec_format()方法重复我的实验,并尝试通过syn0_lockf值尝试不同级别的预加载向量对新训练的syn0_lockf 。但请记住,这是实验领域:基本doc2vec 结果不依赖于,甚至不一定通过重用的词向量来改进。)

Well, I am recently using Doc2Vec too.嗯,我最近也在使用 Doc2Vec。 And I was thinking of using LDA result as word vector and fix those word vectors to get a document vector.我正在考虑使用 LDA 结果作为词向量并修复这些词向量以获得文档向量。 The result isn't very interesting though.然而结果并不是很有趣。 Maybe it's just my data set isn't that good.也许只是我的数据集不太好。 The code is below.代码如下。 Doc2Vec saves word vectors and document vectors together in dictionary doc2vecmodel.syn0. Doc2Vec 将词向量和文档向量一起保存在字典 doc2vecmodel.syn0 中。 You can direct change the vector values.您可以直接更改矢量值。 The only problem may be that you need to find out which position in syn0 represents which word or document.唯一的问题可能是您需要找出 syn0 中的哪个位置代表哪个单词或文档。 The vectors are stored in random order in dictionary syn0.这些向量以随机顺序存储在字典 syn0 中。

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim import corpora, models, similarities
import gensim
from sklearn import svm, metrics
import numpy

#Read in texts into div_texts(for LDA and Doc2Vec)
div_texts = []
f = open("clean_ad_nonad.txt")
lines = f.readlines()
f.close()
for line in lines:
    div_texts.append(line.strip().split(" "))

#Set up dictionary and MMcorpus
dictionary = corpora.Dictionary(div_texts)
dictionary.save("ad_nonad_lda_deeplearning.dict")
#dictionary = corpora.Dictionary.load("ad_nonad_lda_deeplearning.dict")
print dictionary.token2id["junk"]
corpus = [dictionary.doc2bow(text) for text in div_texts]
corpora.MmCorpus.serialize("ad_nonad_lda_deeplearning.mm", corpus)

#LDA training
id2token = {}
token2id = dictionary.token2id
for onemap in dictionary.token2id:
    id2token[token2id[onemap]] = onemap
#ldamodel = models.LdaModel(corpus, num_topics = 100, passes = 1000, id2word = id2token)
#ldamodel.save("ldamodel1000pass.lda")
#ldamodel = models.LdaModel(corpus, num_topics = 100, id2word = id2token)
ldamodel = models.LdaModel.load("ldamodel1000pass.lda")
ldatopics = ldamodel.show_topics(num_topics = 100, num_words = len(dictionary), formatted = False)
print ldatopics[10][1]
print ldatopics[10][1][1]
ldawordindex = {}
for i in range(len(dictionary)):
    ldawordindex[ldatopics[0][i][1]] = i

#Doc2Vec initialize
sentences = []
for i in range(len(div_texts)):
    string = "SENT_" + str(i)
    sentence = models.doc2vec.LabeledSentence(div_texts[i], labels = [string])
    sentences.append(sentence)
doc2vecmodel = models.Doc2Vec(sentences, size = 100, window = 5, min_count = 0, dm = 1)
print "Initial word vector for word junk:"
print doc2vecmodel["junk"]

#Replace the word vector with word vectors from LDA
print len(doc2vecmodel.syn0)
index2wordcollection = doc2vecmodel.index2word
print index2wordcollection
for i in range(len(doc2vecmodel.syn0)):
    if index2wordcollection[i].startswith("SENT_"):
        continue
    wordindex = ldawordindex[index2wordcollection[i]]
    wordvectorfromlda = [ldatopics[j][wordindex][0] for j in range(100)]
    doc2vecmodel.syn0[i] = wordvectorfromlda
#print doc2vecmodel.index2word[26841]
#doc2vecmodel.syn0[0] = [0 for i in range(100)]
print "Changed word vector for word junk:"
print doc2vecmodel["junk"]

#Train Doc2Vec
doc2vecmodel.train_words = False 
print "Initial doc vector for 1st document"
print doc2vecmodel["SENT_0"]
for i in range(50):
    print "Round: " + str(i)
    doc2vecmodel.train(sentences)
print "Trained doc vector for 1st document"
print doc2vecmodel["SENT_0"]

#Using SVM to do classification
resultlist = []
for i in range(4143):
    string = "SENT_" + str(i)
    resultlist.append(doc2vecmodel[string])
svm_x_train = []
for i in range(1000):
    svm_x_train.append(resultlist[i])
for i in range(2210,3210):
    svm_x_train.append(resultlist[i])
print len(svm_x_train)

svm_x_test = []
for i in range(1000,2210):
    svm_x_test.append(resultlist[i])
for i in range(3210,4143):
    svm_x_test.append(resultlist[i])
print len(svm_x_test)

svm_y_train = numpy.array([0 for i in range(2000)])
for i in range(1000,2000):
    svm_y_train[i] = 1
print svm_y_train

svm_y_test = numpy.array([0 for i in range(2143)])
for i in range(1210,2143):
    svm_y_test[i] = 1
print svm_y_test


svc = svm.SVC(kernel='linear')
svc.fit(svm_x_train, svm_y_train)

expected = svm_y_test
predicted = svc.predict(svm_x_test)

print("Classification report for classifier %s:\n%s\n"
      % (svc, metrics.classification_report(expected, predicted)))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))

print doc2vecmodel["junk"]

This forked version of gensim allows loading pre-trained word vectors for training doc2vec.这个 gensim 的分叉版本允许加载预训练的词向量来训练 doc2vec。 Here you have an example on how to use it. 这里有一个关于如何使用它的示例。 The word vectors must be in the C-word2vec tool text format: one line per word vector where first comes a string representing the word and then space-separated float values, one for each dimension of the embedding.词向量必须采用 C-word2vec 工具文本格式:每个词向量一行,首先是一个表示词的字符串,然后是空格分隔的浮点值,每个嵌入的维度一个。

This work belongs to a paper in which they claim that using pre-trained word embeddings actually helps building the document vectors.这项工作属于文件中,他们声称,使用预训练字的嵌入实际上有助于构建文档向量。 However I am getting almost the same results no matter I load the pre-trained embeddings or not.但是,无论我是否加载预训练的嵌入,我都会得到几乎相同的结果。

Edit: actually there is one remarkable difference in my experiments.编辑:实际上,我的实验有一个显着的不同。 When I loaded the pretrained embeddings I trained doc2vec for half of the iterations to get almost the same results (training longer than that produced worse results in my task).当我加载预训练嵌入时,我在一半的迭代中训练了 doc2vec 以获得几乎相同的结果(训练时间比我的任务产生的结果更差)。

Radim just posted a tutorial on the doc2vec features of gensim (yesterday, I believe - your question is timely!).拉迪姆刚刚发布了一个教程上gensim的doc2vec功能(昨天,我相信-你的问题是及时的!)。

Gensim supports loading pre-trained vectors from the C implementation , as described in the gensim models.word2vec API documentation . Gensim 支持从C 实现加载预训练向量,如gensim models.word2vec API 文档中所述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM