简体   繁体   English

查找与word2vec之类的doc2vec的相似性

[英]Find similarity with doc2vec like word2vec

Is there a way to find similar docs like we do in word2vec 有没有办法像在word2vec中一样找到类似的文档

Like: 喜欢:

  model2.most_similar(positive=['good','nice','best'],
    negative=['bad','poor'],
    topn=10)

I know we can use infer_vector,feed them to have similar ones, but I want to feed many positive and negative examples as we do in word2vec. 我知道我们可以使用infer_vector,让它们具有相似的值,但是我想像在word2vec中一样提供许多正面和负面的例子。

is there any way we can do that! 有什么办法可以做到! thanks ! 谢谢 !

The doc-vectors part of a Doc2Vec model works just like word-vectors, with respect to a most_similar() call. 对于most_similar()调用, Doc2Vec模型的doc-vector部分的工作原理与单词向量类似。 You can supply multiple doc-tags or full vectors inside both the positive and negative parameters. 您可以在positive参数和negative参数中提供多个doc标签或完整矢量。

So you could call... 所以你可以打电话给...

sims = d2v_model.docvecs.most_similar(positive=['doc001', 'doc009'], negative=['doc102'])

...and it should work. ...它应该工作。 The elements of the positive or negative lists could be doc-tags that were present during training, or raw vectors (like those returned by infer_vector() , or your own averages of multiple such vectors). positive列表或negative列表的元素可以是训练过程中出现的doc标签,也可以是原始向量(例如,由infer_vector()返回的infer_vector() ,或您自己的多个此类向量的平均值)。

Don't believe there is a pre-written function for this. 不要相信有一个预先编写的功能。

One approach would be to write a function that iterates through each word in the positive list to get top n words for a particular word. 一种方法是编写一个遍历肯定列表中每个单词的函数,以获取特定单词的前n个单词。

So for positive words in your question example, you would end up with 3 lists of 10 words. 因此,对于您的问题示例中的肯定单词,您最终将得到3个包含10个单词的列表。

You could then identify words that are common across the 3 lists as the top n similar to your positive list. 然后,您可以将3个列表中常见的单词标识为前n个,与肯定列表相似。 Since not all words will be common across the 3 lists, you probably need to get top 20 similar words when iterating so you end up with top 10 words as you want in your example. 由于并非所有单词在这三个列表中都是相同的,因此您可能需要在迭代时获得前20个相似的单词,因此在示例中最终需要获得前10个单词。

Then do the same for negative words. 然后对否定词做同样的事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM