相似数据算法

Question

I have a couple DBs of user information, each one 10k-20k entries, each one from a couple of different sources, and each one constantly growing. 我有几个用户信息数据库，每个数据库10k-20k条目，每个数据库来自几个不同的来源，并且每个数据库都在不断增长。 I'm looking to create a tool that can within a certain tolerance notice similar emails, or similar names ( first + ' ' + last ). 我正在寻找一种可以在一定公差范围内注意到相似电子邮件或相似名称（first +''+ last）的工具。 I'm running a MySQL database, and can work with either C++ or PHP to run the comparison. 我正在运行一个MySQL数据库，并且可以与C ++或PHP一起运行比较。 Can anyone suggest any existing solutions / tutorials that would allow me to just run a check against the database or an array of data and return possible duplicates? 谁能建议任何现有的解决方案/教程，让我可以对数据库或数据数组进行检查并返回可能的重复项？ I'd just want it to pick up a few common mistakes like these: 我只希望它能解决一些常见错误，例如：

josh@test.com <> josh@test.test.com <> jash@test.com
Josh O <> josh t O <> Joshua O

Maybe have the tolerance adjustable to within a certain amount of characters difference between the entries? 也许可以将公差调整到条目之间一定数量的字符差异之内？ Thanks you very, very much for any advice or solutions, I've not had much success Googling. 非常非常感谢您提供的任何建议或解决方案，Google谷歌搜索并没有取得多少成功。

Answer 1

I have some great news for you, and some horrible news for you. 我有一些好消息给您，也有一些可怕的消息给您。

The great news is that PHP has implementations of a few algorithms to compare strings built right in: 好消息是PHP具有一些算法的实现，可以比较内置的字符串：

It also has two relatively popular ways to break down English-ish words into more simple representations that are suitable for comparison: 它还有两种相对流行的方法，可以将英语单词分解为更适合比较的简单表示形式：

While that's great news, the horrible news is that with 10-20k entries, you're going to need to perform somewhere close to one and a half metric ass-tons of comparisons if you use the first two options, and they aren't great performers. 虽然这是个好消息，但可怕的消息是，如果使用10-20k条目，则需要使用前两个选项来进行接近一个半公吨的比较，而它们并不是出色的表演者。 I'm not too sure about what that would be in big-O notation, but I think it's somewhere in the range of O(run away) . 我不太确定big-O表示法是什么，但是我认为它在O(run away)范围内。

Pre-calculating a similarity breakdown using the latter two functions and then using some variety of grouping operation on the resulting data might prove to be a major performance and time win. 使用后两个函数预先计算相似性分解，然后对所得数据使用某种分组操作可能会证明是主要的性能并节省了时间。

Answer 2

That would depend on your notion of "similarity". 那将取决于您的“相似性”概念。 If you are looking for the number of characters that must be inserted, deleted or replaced in order to transform one string into another, the algorithm is called Levenshtein distance . 如果要查找必须插入，删除或替换的字符数，才能将一个字符串转换为另一个字符串，该算法称为Levenshtein distance 。 Be warned, though, that it is quite slow for long strings (as each comparison uses a number of operations that is proportional to mn , where m and n are the lengths of the strings being compared), but if your data is email addresses and other short strings, you should be fine (and your biggest problem would be the number of comparisons, as you would need to compare each pair of strings to each other). 但是请注意，长字符串会非常慢（因为每个比较都使用与mn成正比的许多运算，其中m和n是要比较的字符串的长度），但是如果您的数据是电子邮件地址，并且其他短字符串，则应该没问题（最大的问题是比较数，因为您需要将每对字符串相互比较）。

Answer 3

Given a maximum character distance, this sounds like a job for the bitap algorithm (Wu and Manber, "Fast Searching with Text Errors") . 给定最大字符距离，这听起来像是bitap算法的工作（Wu和Manber，“快速搜索有文本错误”）。 It's the core algorithm of the agrep program and it can be quite fast when the number of acceptable character errors is limited. 这是agrep程序的核心算法，当可接受的字符错误数受到限制时，它可能会非常快。 Google's implementation in library form for several languages can be found here. 可以在此处找到Google以几种语言的库形式的实现。 (The code for just doing the approximate match is relatively short and well documented.) （仅用于近似匹配的代码相对较短，并且有据可查。）

You're still looking at O(n ² ) for the total number of e-mail to e-mail comparisons (~400M for 20k e-mails). 您仍在将O（n ² ）作为电子邮件与电子邮件比较的总数（对于20k电子邮件为〜400M）。 But a well tuned implementation of a good comparison function like bitap should help to reduce the constant. 但是，良好的比较功能（如bitap）的良好实现应有助于减少常数。 You can probably also cull a bunch of comparisons by dividing the e-mails into groups based on length and only matching e-mails between groups that are within a limited difference in size (eg, if you're tolerance is 3 character differences, it's pointless to compare any 10-character e-mails to any 20-character e-mails.). 您还可以根据长度将电子邮件分为几组，并且仅在大小差异不大的组之间匹配电子邮件（例如，如果您的容忍度是3个字符的差异，则是将任何10个字符的电子邮件与任何20个字符的电子邮件进行比较是毫无意义的。） You should also be able to parallelize the comparisons if you have a multicore machine. 如果您有多核计算机，那么您还应该能够并行化比较。 Again, these are reductions in the constant, not the order, but I'd guess a good C++ implementation on a fast machine could handle this in a couple of minutes. 同样，这些都是常量的减少，而不是顺序的减少，但是我猜想，在一台快速的计算机上良好的C ++实现可以在几分钟之内完成。

相似数据算法

问题描述

3 个解决方案

解决方案1
2 2011-03-11 17:46:05

解决方案2
1 已采纳 2011-03-11 17:43:58

解决方案3
1 2011-03-11 19:23:49

相似数据算法

问题描述

3 个解决方案

解决方案1 2 2011-03-11 17:46:05

解决方案2 1 已采纳 2011-03-11 17:43:58

解决方案3 1 2011-03-11 19:23:49

解决方案1
2 2011-03-11 17:46:05

解决方案2
1 已采纳 2011-03-11 17:43:58

解决方案3
1 2011-03-11 19:23:49