简体   繁体   English

如何在雪花中进行模糊匹配 / SQL

[英]How to do Fuzzy match in Snowflake / SQL

How to do fuzzy match in Snowflake / SQL如何在 Snowflake / SQL 中进行模糊匹配

Here is the business logic这是业务逻辑

The ABC Company INC, The north America, ABC (Those two should shows a match) The ABC Company INC, The North America, ABC(这两个应该显示匹配)

The 16K LLC, 16K LLC (Those two should shows a match) enter image description here I attached some test data. 16K LLC,16K LLC(这两个应该显示匹配)在此处输入图像描述我附上了一些测试数据。 Thank so much guys!非常感谢你们!

Any matching attempt that treats string pairs like "The ABC Company INC" and "The north America, ABC" or "Preferred ABC Group" and "The Preferred Residences" as a match is probably going to give you many false positive matches, since in some of your examples there is only one word similar between the strings.任何将字符串对(如“The ABC Company INC”和“The north America, ABC”或“Preferred ABC Group”和“The Preferred Residences”视为匹配项的任何匹配尝试都可能会给您带来许多误报匹配,因为在在您的一些示例中,字符串之间只有一个词相似。

That said, Snowflake does provide a couple of functions that might help: EDITDISTANCE and JAROWINKLER_SIMILARITY .也就是说,Snowflake 确实提供了一些可能有帮助的函数: EDITDISTANCEJAROWINKLER_SIMILARITY

EDITDISTANCE generates a number that represents the Levenshtein distance between two strings (basically the number of edits it would take to change one string into the other). EDITDISTANCE生成一个数字,表示两个字符串之间的Levenshtein 距离(基本上是将一个字符串更改为另一个字符串所需的编辑次数)。 A lower number indicates fewer edits needed and so potentially a closer match.较低的数字表示需要较少的编辑,因此可能更接近匹配。

JAROWINKLER_SIMILARITY uses an algorithm to calculate a "similarity" score between 0 and 100 for two strings. JAROWINKLER_SIMILARITY使用一种算法来计算两个字符串的“相似度”分数在 0 到 100 之间。 A higher number indicates more similarity, 100 being an exact match.数字越大表示相似度越高,100 表示完全匹配。

You could use either or both of these functions to generate scores for each pair of strings and then decide on a threshold that best represents a match for your purposes.您可以使用这些函数中的一个或两个来为每对字符串生成分数,然后根据您的目的确定最能代表匹配的阈值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM