简体繁体 English

Levenshtein距离Python UDF作为SQL连接中的模糊匹配代理

[英]Levenshtein distance Python UDF as fuzzy matching proxy in SQL join

原文 2018-02-09 15:32:24 0 1 python/ sql/ statistics/ amazon-redshift/ levenshtein-distance

I came across a forum post that describes a method of creating a Python UDF in Redshift: https://community.periscopedata.com/r/y715m2 . 我遇到了一个论坛帖子，该帖子描述了在Redshift中创建Python UDF的方法： https : //community.periscopedata.com/r/y715m2 。

More info about Python UDFs in Redshift: https://docs.aws.amazon.com/redshift/latest/dg/udf-python-language-support.html 有关Redshift中Python UDF的更多信息： https : //docs.aws.amazon.com/redshift/latest/dg/udf-python-language-support.html

I checked a number of outputs by the function (like select public.levenshtein('walk', 'cake') )- and it works quite well. 我通过该功能检查了一些输出（例如select public.levenshtein('walk', 'cake') ）-效果很好。

I am hoping to use this concept for fuzzy matching in joins between two tables on t1.first_name+last_name = t2.first_name+last_name . 我希望在t1.first_name+last_name = t2.first_name+last_name两个表之间的联接中使用此概念进行模糊匹配。

Is anyone familiar with a "magical range" (or can suggest something from experience) in which a record should fall between to be deemed a likely match? 是否有人熟悉某个记录应该介于两者之间的“魔术范围”（或可以从经验中提出一些建议）？ ie. 即。 what should the min and max levenshtein (s,t) be to be considered a likely match. 最小和最大levenshtein（s，t）应该被认为是可能的匹配。

1 个解决方案

It depends on your particular case. 这取决于您的具体情况。 Think of it as a simple machine learning problem when you provide a training dataset - you can run the function against your data to see the values for different kinds of pairs and set your range based on that. 提供训练数据集时，可以将其视为简单的机器学习问题-您可以对数据运行该函数以查看不同种类的对的值，并以此为基础设置范围。 If you're matching names the cost of error is quite high, both for false negative (no match for the same person) and false positive (match for different people) cases, so I would go with soundex rather than leuvenstein . 如果您要匹配名称，那么对于假阴性（对于同一个人没有匹配项）和假阳性（对于不同的人匹配）而言，错误的代价都是很高的，所以我会选择soundex而不是leuvenstein 。 AFAIK Leuvenstein distance would be equal to one for very different last names if they are different only in one letter but this can be two cases - when last names are actually the same but spelled differently, or when the last names are actually different but the difference is one letter. 如果非常不同的姓氏仅在一个字母中不同，那么AFAIK鲁汶斯坦距离将等于1，但这可能是两种情况-当姓氏实际上相同但拼写不同时，或者当姓氏实际上不同但有所不同时是一个字母。 Soundex is better for distinguishing such cases. Soundex更适合区分此类情况。