简体   繁体   English

Levenshtein距离Python UDF作为SQL连接中的模糊匹配代理

[英]Levenshtein distance Python UDF as fuzzy matching proxy in SQL join

I came across a forum post that describes a method of creating a Python UDF in Redshift: https://community.periscopedata.com/r/y715m2 . 我遇到了一个论坛帖子,该帖子描述了在Redshift中创建Python UDF的方法: https : //community.periscopedata.com/r/y715m2

More info about Python UDFs in Redshift: https://docs.aws.amazon.com/redshift/latest/dg/udf-python-language-support.html 有关Redshift中Python UDF的更多信息: https : //docs.aws.amazon.com/redshift/latest/dg/udf-python-language-support.html

I checked a number of outputs by the function (like select public.levenshtein('walk', 'cake') )- and it works quite well. 我通过该功能检查了一些输出(例如select public.levenshtein('walk', 'cake') )-效果很好。

I am hoping to use this concept for fuzzy matching in joins between two tables on t1.first_name+last_name = t2.first_name+last_name . 我希望在t1.first_name+last_name = t2.first_name+last_name两个表之间的联接中使用此概念进行模糊匹配。

Is anyone familiar with a "magical range" (or can suggest something from experience) in which a record should fall between to be deemed a likely match? 是否有人熟悉某个记录应该介于两者之间的“魔术范围”(或可以从经验中提出一些建议)? ie. 即。 what should the min and max levenshtein (s,t) be to be considered a likely match. 最小和最大levenshtein(s,t)应该被认为是可能的匹配。

It depends on your particular case. 这取决于您的具体情况。 Think of it as a simple machine learning problem when you provide a training dataset - you can run the function against your data to see the values for different kinds of pairs and set your range based on that. 提供训练数据集时,可以将其视为简单的机器学习问题-您可以对数据运行该函数以查看不同种类的对的值,并以此为基础设置范围。 If you're matching names the cost of error is quite high, both for false negative (no match for the same person) and false positive (match for different people) cases, so I would go with soundex rather than leuvenstein . 如果您要匹配名称,那么对于假阴性(对于同一个人没有匹配项)和假阳性(对于不同的人匹配)而言,错误的代价都是很高的,所以我会选择soundex而不是leuvenstein AFAIK Leuvenstein distance would be equal to one for very different last names if they are different only in one letter but this can be two cases - when last names are actually the same but spelled differently, or when the last names are actually different but the difference is one letter. 如果非常不同的姓氏仅在一个字母中不同,那么AFAIK鲁汶斯坦距离将等于1,但这可能是两种情况-当姓氏实际上相同但拼写不同时,或者当姓氏实际上不同但有所不同时是一个字母。 Soundex is better for distinguishing such cases. Soundex更适合区分此类情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM