简体   繁体   English

模糊匹配 SQL 中的字符串

[英]Fuzzy matching a string in SQL

I have a User table, that has id , first_name , last_name , street_address , city , state , zip-code , firm , user_identifier , created_at , update_at .我有一个User表,它有idfirst_namelast_namestreet_addresscitystatezip-codefirmuser_identifiercreated_atupdate_at

This table has a lot of duplication like the same users have been entered multiple times as a new user, so example这个表有很多重复,就像同一个用户作为新用户被多次输入,所以例子


id  first_name  last_name  street_address  user_identifier
---------------------------------------------------------
11   Mary       Doe        123 Main Ave     M2111111
---------------------------------------------------------
21  Mary        Doe        123 Main Ave     M2344455
---------------------------------------------------------
13  Mary Esq    Doe        123 Main Ave     M1233444

I would like to know if there is a way of doing fuzzy matching on this table.我想知道是否有办法在这张表上进行模糊匹配。

Basically I would like to find all the users that have the same name, same address but can be with a slight difference, maybe the address is the same but has different apartment number, or has a middle name and the other duplicates don't.基本上我想找到所有具有相同名称,相同地址但可能略有不同的用户,也许地址相同但公寓号码不同,或者有中间名而其他重复项没有。

I was thinking to create a new column that has concatenated first_name, last_name, street_address and do a fuzzy match on that column.我正在考虑创建一个连接first_name、last_name、street_address的新列,并对该列进行模糊匹配。

I tried levenshtein distance on concatenated first_name and last_name as full_name but doesn't seem to catch up the name that has a middle name我在连接的 first_name 和 last_name 作为full_name上尝试了 levenshtein 距离,但似乎没有赶上具有中间名的名称

select * from users
where levenshtein('Mary Doe', full_name) <=1;

I am using a Databricks and PostgreSQL.我正在使用 Databricks 和 PostgreSQL。

Thank you!谢谢!

In postgres you can use fuzzystrmatch package.在 postgres 中,您可以使用fuzzystrmatch package。 It provies a levenshtein function, that returns distance between two texts, you can then perform fuzzy matching with the following exemplary predicate:它提供了一个levenshtein function,它返回两个文本之间的距离,然后您可以使用以下示例性谓词执行模糊匹配:

where levenshtein(street_address, '123 Main Avex') <= 1

This will match all records, because the distance between '123 Main Ave' and '123 Main Avex' is 1 (1 insertion).这将匹配所有记录,因为“123 Main Ave”和“123 Main Avex”之间的距离为 1(1 次插入)。

Of course, value 1 here is just an example and will perform matching quite strictly (difference by only one character).当然,这里的值1只是一个示例,将执行非常严格的匹配(仅一个字符的差异)。 You should either use larger number or, what @IVO GELOV sugests - use relative distance (distance divided by the length).您应该使用更大的数字,或者@IVO GELOV 所建议的 - 使用相对距离(距离除以长度)。

If you get to the point where Levenshtein ("edit distance") isn't capturing all of the matches you need, I'd strongly encourage you to check out pg_tgrm.如果您发现 Levenshtein(“编辑距离”)没有捕获您需要的所有匹配项,我强烈建议您查看 pg_tgrm。 It.它。 Is.是。 Awesome.惊人的。

postgresql.org/docs/current/pgtrgm.html. postgresql.org/docs/current/pgtrgm.html。

As an example of why to use trigrams, they let you pick up cases where first_name and last_name are reversed, a relatively common error.作为为什么要使用三元组的一个例子,它们让你找出first_namelast_name颠倒的情况,这是一个相对常见的错误。 Levenshtein isn't well matched to spotting that as all it does is transform the one string into another, and count the number of moves required. Levenshtein 不能很好地发现它,因为它所做的只是将一个字符串转换为另一个字符串,并计算所需的移动次数。 When you've got elements swapped, they increase the distance quite a bit and make the match less likely.当您交换元素时,它们会大大增加距离并降低匹配的可能性。 As an example, pretend that you have a record where the right full name is "David Adams".例如,假设您有一个正确的全名是“David Adams”的记录。 It's pretty common to find the last name as "Adam", and to find first and last names reversed.将姓氏查找为“Adam”并且查找名字和姓氏颠倒是很常见的。 So, that's three plausible forms for a simple name.所以,这是三个看似合理的 forms 的简单名称。 How does Levenshtein perform compared with the Postgres trigram implementation?与 Postgres trigram 实现相比,Levenshtein 的表现如何? For this, I compared levenshtein(string 1, string 2) with similarity(string 1, string 2) .为此,我将levenshtein(string 1, string 2)similarity(string 1, string 2)进行了比较。 As noted above, Levenshtein is a count where a higher score means less similar.如上所述,Levenshtein 是一个分数越高意味着越不相似的计数。 To normalize the scores to a 0-1 value where 1 = identical, I divided it by the max full name length, as suggested above, and subtracted it from 1. That last bit is to make the figures directly comparable to a similarity() score.为了将分数标准化为 0-1 值,其中 1 = 相同,我将其除以最大全名长度,如上所述,然后从 1 中减去它。最后一点是使数字直接与similarity()分数。 (Otherwise, you've got numbers where 1 means opposite things.) (否则,你会得到 1 表示相反的数字。)

Here's are some simple results, rounded a bit for clarity这是一些简单的结果,为了清楚起见,稍微四舍五入

Row 1           Row 2        Levenshtein()  Levensthein %   Similarity %
David Adams     Adam David              10              9             77
Adam David      Adams David              1             91             77
Adams David     David Adams             10              9            100

As you can see, the similarity() score performs better in a lot of cases, even with this simple example.如您所见,在很多情况下, similarity()得分表现更好,即使是这个简单的例子。 Then again, Levenshtein feels better in one case.话又说回来,Levenshtein 在一种情况下感觉更好。 It's not rare to combine techniques.结合技术并不罕见。 If you do that, normalize the scales to save yourself some headache.如果你这样做了,把天平标准化,以免让自己头疼。

But all of this is made a lot easier if you've got cleaner data to start with.但是,如果您有更清晰的数据开始,所有这一切都会变得容易得多。 If one of your problems is with inconsistent abbreviations and punctuation, Levenshtein can be a poor match.如果您的问题之一是缩写和标点符号不一致,那么 Levenshtein 可能是一个糟糕的匹配。 For this reason, it's helpful to perform address standardization before duplicate matching.因此,在重复匹配之前执行地址标准化很有帮助。

For what it's worth (a lot), trigrams in Postgres can use indexes.对于它的价值(很多),Postgres 中的三元组可以使用索引。 It can be a good bet to try and find a technique to safely reduce candidates with an indexed search before performing an more expensive comparison with something like Levenshtein.在与 Levenshtein 等进行更昂贵的比较之前,尝试找到一种通过索引搜索安全地减少候选者的技术可能是一个不错的选择。 Ah, and a trick for Levenshtein is that if you have a target/tolerance, and have the length of your strings stored, you can exclude strings that are too short or long off that stored length without running the more expensive fuzzy comparison.啊,Levenshtein 的一个技巧是,如果你有一个目标/公差,并且存储了字符串的长度,你可以排除那些太短或太长的字符串,而不需要运行更昂贵的模糊比较。 If, for example, you have a starting string of length 10 and only want strings that are at most 2 transformations away, you're wasting your time to test strings that are only 7 characters long etc.例如,如果您有一个长度为 10 的起始字符串,并且只想要最多 2 个转换的字符串,那么您就是在浪费时间来测试只有 7 个字符长的字符串等。

Note that the bad data input problem you describe often comes down to请注意,您描述的错误数据输入问题通常归结为

  • poor user training and/or糟糕的用户培训和/或
  • poor UX糟糕的用户体验

It's worth reviewing how bad data is getting in, once you've got your cleanup in good order.一旦你的清理工作井井有条,就值得回顾一下糟糕的数据是如何进入的。 If you have a finite set of trainable users, it can help to run a nightly (etc.) scan to detect new likely duplicates, and then go and talk to whoever is generating them.如果您有一组有限的可训练用户,它可以帮助运行夜间(等)扫描以检测新的可能重复项,然后 go 并与生成它们的人交谈。 Maybe there's something they don't know you can tell them, maybe there's a problem in the UI that you don't know that they can tell you.也许他们不知道你可以告诉他们的事情,也许 UI 中存在你不知道他们可以告诉你的问题。

There is a like operator.有一个like 运算符。 Have you considered trying that?你考虑过尝试吗?

The following SQL statement selects all customers with a CustomerName starting with "a": Example以下 SQL 语句选择 CustomerName 以“a”开头的所有客户: 示例

SELECT * FROM Customers
WHERE CustomerName LIKE 'a%';

https://www.w3schools.com/sql/sql_like.asp https://www.w3schools.com/sql/sql_like.asp

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM