简体繁体 English

根据预先计算的哈希值比较字符串距离

[英]Comparing string distance based on precomputed hashes

原文 2010-08-12 23:33:51 8 1 string/ hash/ compare/ distance/ precompute

I have a large list (over 200,000) of strings that I'd like to compare to a given string. 我有一个很大的列表（超过200,000）字符串，我想与给定的字符串进行比较。 The given string is inserted by a user, so it may be slightly incorrect. 给定的字符串由用户插入，因此可能稍微不正确。

What I was hoping to do was create some kind of precomputed hash on each string on adding it to the list. 我希望做的是在将每个字符串添加到列表中时为每个字符串创建一些预先计算的哈希值。 This hash would contain information such as string length, addition of all the characters etc. 此哈希将包含诸如字符串长度，所有字符的添加等信息。

My question is, does something like this already exist? 我的问题是，这样的事情已经存在吗？ Surely there would be something that lets me avoid running Levenshtein distance on every string in the list? 肯定会有一些东西让我避免在列表中的每个字符串上运行Levenshtein距离？

Or maybe there's a third option I haven't thought of yet? 或许还有第三种选择，我还没想过呢？

1 个解决方案

Sounds like you want to use a fuzzy hash of some sort. 听起来你想要使用某种模糊散列。 There are lots of hash functions available that can do things like this. 有很多哈希函数可以做这样的事情。 The classic old " SOUNDEX " algorithm might even work. 经典的旧“ SOUNDEX ”算法甚至可能有效。

Another thought - if you estimate that the probability of an incorrect entry is low, then you might actually be fine having a direct hit 99.9% of the time, falling back to SOUNDEX which might catch 90% of the remaining cases and then searching the whole list for the remaining 0.01% of the time. 另一个想法 - 如果你估计输入错误的可能性很低，那么你可能实际上很好地直接命中99.9％的时间，回到SOUNDEX，这可能会捕获90％的剩余案例，然后搜索整个列出剩余0.01％的时间。

Also worth checking this discussion: How to find best fuzzy match for a string in a large string database 还值得检查这个讨论：如何在大型字符串数据库中找到字符串的最佳模糊匹配