简体   繁体   English

如何在短语中使用doc2vec?

[英]How to use doc2vec with phrases?

i want to have phrases in doc2vec and i use gensim.phrases. 我想在doc2vec中使用短语,我使用gensim.phrases。 in doc2vec we need tagged document to train the model and i cannot tag the phrases. 在doc2vec中,我们需要标记文档来训练模型,而我无法标记短语。 how i can do this? 我该怎么做?

here is my code 这是我的代码

text = phrases.Phrases(text)
for i in range(len(text)):
    string1 = "SENT_" + str(i)

    sentence = doc2vec.LabeledSentence(tags=string1, words=text[i])
    text[i]=sentence

print "Training model..."
model = Doc2Vec(text, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

The invocation of Phrases() trains a phrase-creating-model. Phrases()的调用可训练短语创建模型。 You later use that model on text to get back phrase-combined text. 稍后,您可以在文本上使用该模型来取回短语组合的文本。

Don't replace your original text with the trained model, as on your code's first line. 不要像代码第一行那样用经过训练的模型替换原始text Also, don't try to assign into the Phrases model, as happens in your current loop, nor access the Phrases model by integers. 另外,不要像当前循环中那样尝试分配给短语模型,也不要通过整数访问短语模型。

The gensim docs for the Phrases class has examples of the proper use of the Phrases class; 所述的词类gensim文档具有正确使用的示例Phrases类; if you follow that pattern you'll do well. 如果遵循这种模式,您会做得很好。

Further, note that LabeledSentence has been replaced by TaggedDocument , and its tags argument should be a list-of-tags. 此外,注意LabeledSentence已取代TaggedDocument ,其tags参数应该是一个列表的标签。 If you provide a string, it will see that as a list-of-one-character tags (instead of the one tag you intend). 如果提供字符串,它将被视为一个字符列表的标签(而不是您想要的一个标签)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM