简体繁体 English

如何有效地计算/估计非晶石矩阵中数十亿对的余弦相似度？

[英]How to efficiently calculate/estimate cosine similarity for billions of pairs in a non-spars matrix?

原文 2020-07-22 16:16:26 1 1 scala/ apache-spark/ hadoop/ cosine-similarity

Consider I have 10 million items, each identified with a 100 dimension vector of real numbers (actually they are word2vec embeddings).考虑一下我有 1000 万个项目，每个项目都用 100 维实数向量标识（实际上它们是 word2vec 嵌入）。 For each item I want to get (approximately) the top 200 most similar items to it, using Cosine similarity.对于每个项目，我想（大约）使用余弦相似度获得与其最相似的前 200 个项目。 My current cosine similarity standard implementation as UDF function in Hadoop (hive) takes about 1s to calculate the cosine similarity of 1 item compared with 10 million other items.我当前的余弦相似度标准实现为 Hadoop（hive）中的 UDF function，与 1000 万个其他项目相比，计算 1 项的余弦相似度大约需要 1 秒。 This renders it infeasible to run for whole matrix.这使得运行整个矩阵变得不可行。 My next move is to run it on Spark, with more parallelization, but still it won't solve the problem completely.我的下一步是在 Spark 上运行它，并行化程度更高，但仍然不能完全解决问题。

I know there are some methods to reduce the calculation for a spars matrix.我知道有一些方法可以减少晶石矩阵的计算。 But my matrix is NOT sparse .但我的矩阵并不稀疏。

How can I efficiently get the most similar items for each item?如何有效地为每个项目获取最相似的项目？ Is there an approximation of cosine similarity that will be more efficient to calculate?是否存在计算效率更高的余弦相似度近似值？

1 个解决方案

You can compress the vector to make the score calculation simpler.您可以压缩向量以使分数计算更简单。 By new distance approach like hamming distance.通过新的距离方法，如汉明距离。

There is a keyword called vector quantization , and there are many algorithms talk about vector compression.有一个关键词叫vector quantization ，还有很多算法都在讲向量压缩。

Here is an example of making it comparable to cosine similarity.这是一个使其与余弦相似度相媲美的示例。

https://github.com/tdebatty/java-LSH/blob/master/src/main/java/info/debatty/java/lsh/SuperBit.java#L208 https://github.com/tdebatty/java-LSH/blob/master/src/main/java/info/debatty/java/lsh/SuperBit.java#L208