简体   繁体   English

所有文档的成对地球移动距离(word2vec 表示)

[英]Pairwise Earth Mover Distance across all documents (word2vec representations)

Is there a library that will take a list of documents and en masse compute the nxn matrix of distances - where the word2vec model is supplied?是否有一个库可以获取文档列表并集体计算 nxn 距离矩阵 - 其中提供了 word2vec 模型? I can see that genism allows you to do this between two documents - but I need a fast comparison across all docs.我可以看到 genism 允许您在两个文档之间执行此操作 - 但我需要对所有文档进行快速比较。 like sklearns cosine_similarity.像 sklearns cosine_similarity。

The "Word Mover's Distance" (earth-mover's distance applied to groups of word-vectors) is a fairly involved optimization calculation dependent on every word in each document. “Word Mover 的距离”(应用于词向量组的土方距离)是一个相当复杂的优化计算,依赖于每个文档中的每个词。

I'm not aware of any tricks that would help it go faster when calculating many at once – even many distances to the same document.我不知道有什么技巧可以帮助它在一次计算多个时更快 - 甚至是同一个文档的许多距离。

So the only thing needed to calculate pairwise distances are nested loops to consider each (order-ignoring unique) pairing.因此,计算成对距离唯一需要的是嵌套循环来考虑每个(忽略顺序的唯一)配对。

For example, assuming your list of documents (each a list-of-words) is docs , a gensim word-vector model in model , and numpy imported as np , you could calculate the array of pairwise distances D with:例如,假设您的文档列表(每个单词列表)是docs ,模型中的 gensim 词向量model ,以及作为np导入的numpy ,您可以计算成对距离 D 的数组:

D = np.zeros((len(docs), len(docs)))
for i in range(len(docs)):
    for j in range(len(docs)):
        if i == j:
            continue  # self-distance is 0.0
        if i > j:
            D[i, j] = D[j, i]  # re-use earlier calc
        D[i, j] = model.wmdistance(docs[i], docs[j])

It may take a while, but you'll then have all pairwise distances in array D.这可能需要一段时间,但您将拥有数组 D 中的所有成对距离。

On top of the accepted answer you may want to use the faster wmd library wmd-relax .在接受的答案之上,您可能想要使用更快的 wmd 库wmd-relax

The example then could be adjusted to:然后可以将示例调整为:

D[i, j] = docs[i].similarity(docs[j])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM