简体繁体 English

无交叉连接的不同表模糊匹配（Snowflake）

[英]Fuzzy Matching in Different Tables with No Cross Join(Snowflake)

原文 2022-02-16 16:05:21 4 2 sql/ snowflake-cloud-data-platform/ matching/ string-matching

There are two tables A and B.有两个表A和B。

They both contain titles referencing the same thing, but the naming conventions are always different and cannot be predicted.它们都包含引用同一事物的标题，但命名约定总是不同且无法预测。

The only way to match titles is to find low difference scores on a number of columns, but for now only the title is important.匹配标题的唯一方法是在多个列上找到低差异分数，但目前只有标题很重要。

There are only about 10,000 records in each table currently.当前每个表中只有大约 10,000 条记录。 Using the standard Cross Join and EditDistance combination works fine now.使用标准的 Cross Join 和 EditDistance 组合现在工作正常。 But I've already noticed performance decreases as the number of records grow.但我已经注意到性能会随着记录数量的增加而下降。

Is there a more performant way of achieving the desired result of finding partial matches between strings in different tables?是否有更高效的方法来实现在不同表中的字符串之间查找部分匹配的预期结果？

I apologize if there is an obvious answer.如果有明显的答案，我深表歉意。 The few posts that deviate from the editdistance solution still assume cross joining: https://community.snowflake.com/s/question/0D50Z00008zPLLxSAO/join-with-partial-string-match偏离 editdistance 解决方案的少数帖子仍然假设交叉连接： https://community.snowflake.com/s/question/0D50Z00008zPLLxSAO/join-with-partial-string-match

2 个解决方案

You should use a blocking key strategy to help cut down on the number of pairs generated.您应该使用阻塞键策略来帮助减少生成的对数。 This document explains this strategy and other techniques for Fuzzy Matching on Snowflake.本文档解释了 Snowflake 上的模糊匹配策略和其他技术。 https://drive.google.com/file/d/1FuxZnXojx71t-1kNOaqg1ErrEiiATdsM/view?usp=sharing https://drive.google.com/file/d/1FuxZnXojx71t-1kNOaqg1ErrEiiATdsM/view?usp=sharing

As per Ryan point, the way to avoid comparing all values is to prune "what values are joined".根据 Ryan 的观点，避免比较所有值的方法是修剪“加入了哪些值”。

In other domains (spatial) we found quantizing the GPS down and then joining the 8 surrounding buckets, while made for "more comparisons for things a human could see where near" eliminated all the compares for the things that "clearly are very far away".在其他领域（空间），我们发现将 GPS 向下量化，然后加入周围的 8 个桶，同时为了“对人类可以在附近看到的事物进行更多比较”消除了对“显然非常遥远”的事物的所有比较.

Like most expensive computation, you want to prune as much as you can without missing things you want to include.与最昂贵的计算一样，您希望尽可能多地修剪而不遗漏要包含的内容。 Which is to say false positives are fine, but false negatives are very bad.也就是说，假阳性很好，但假阴性非常糟糕。

So how you batch/bucket/prune you data is very application data specific.因此，如何对数据进行批处理/存储/修剪是非常特定于应用程序数据的。