简体繁体 English

比较大量二进制字符串

[英]Comparing Large Number of Binary Strings

原文 2018-09-28 16:00:53 8 2 python/ elasticsearch/ solr/ redis/ memcached

All,全部，

Writing to see if anyone has any input on what they feel the best tech would be for the following scenario.写信看看是否有人对他们认为在以下情况下最好的技术有任何意见。 Be it python, solr, redis, memcache, etc.无论是python、solr、redis、memcache等。

The situation is as follows.情况如下。

I have 100 million+ binary strings which are around 1100 characters long... '0010100010101001010101011....'我有 1 亿多个二进制字符串，大约有 1100 个字符长...'0010100010101001010101011....'

What in your opinion would be the most logical way to do the following?您认为执行以下操作最合乎逻辑的方法是什么？

For a given string of the same number of characters, what would be the most efficient way to find the closest match?对于具有相同字符数的给定字符串，找到最接近的匹配项的最有效方法是什么？ By closest, I mean sharing the greatest number of 0's and 1's at a given position.最接近，我的意思是在给定位置共享最大数量的 0 和 1。 Hamming Distance, I believe.汉明距离，我相信。

My use case would actually involve taking 100k or so strings and trying to find their best match in the pool of 100 million+ strings.我的用例实际上涉及获取 100k 左右的字符串并尝试在 1 亿多个字符串池中找到它们的最佳匹配。

Any thoughts?有什么想法吗？ No particular tech has to be used, just preferably something that is fairly common.不需要使用特定的技术，最好使用相当普遍的技术。

Curious to see what ideas anyone may have.很想知道任何人可能有什么想法。

Thanks, Tbone谢谢，Tbone

2 个解决方案

You could use numpy, R, or MATLAB, or anything else that works with large matrices for this:您可以使用 numpy、R 或 MATLAB，或任何其他适用于大矩阵的方法：

Say you have a NxM matrix A, where N is len(string) and M is the number of strings.假设您有一个 NxM 矩阵 A，其中 N 是 len(string)，M 是字符串的数量。 And say you have a string S you're trying to match.并假设您有一个要匹配的字符串 S。 You could:你可以：

Subtract the array version of S from A从 A 中减去 S 的数组版本
Take the the absolute value of all the elements of the result of (1)取(1)结果的所有元素的绝对值
Sum the result of (2) along the axis of N沿 N 轴对 (2) 的结果求和
Argsort the result of (3) to find the indexes of the strings that have the lowest distance to S.对 (3) 的结果进行 Argsort 以找到与 S 距离最小的字符串的索引。

You are basically trying to conduct nearest neighbor search in Hamming space on Elasticsearch.您基本上是在尝试在 Elasticsearch 的汉明空间中进行最近邻搜索。

Regarding this, a recently proposed FENSHSES method from [1] seems to be the state-of-the-art one on Elasticsearch.关于这一点，[1] 中最近提出的 FENSHSES 方法似乎是 Elasticsearch 上最先进的方法。

[1] Mu, C, Zhao, J., Yang, G., Yang, B. and Yan, Z., 2019, October. [1] Mu, C, Zhao, J., Yang, G., Yang, B. 和 Yan, Z.，2019 年 10 月。 Fast and Exact Nearest Neighbor Search in Hamming Space on Full-Text Search Engines.在全文搜索引擎上的汉明空间中快速准确的最近邻搜索。 In International Conference on Similarity Search and Applications (pp. 49-56).在国际相似性搜索和应用会议上（第 49-56 页）。 Springer, Cham.斯普林格，查姆。