[英]BigQuery: Fuzzy-join two tables
我需要有效地模糊连接两个大表。
样本数据:
WITH orgs AS (
(SELECT 'Microsoft' AS org)
UNION ALL
(SELECT 'Micro-soft' AS org)
UNION ALL
(SELECT 'Microsoft.com' AS org)
UNION ALL
(SELECT '@microsoft' AS org)
UNION ALL
(SELECT 'Microsoft Vancouver' AS org)
UNION ALL
(SELECT 'Apple' AS org)
UNION ALL
(SELECT 'Netflix' AS org)
),
orgs_ids AS (
(SELECT 'Microsoft' AS org_name, '1' AS id)
UNION ALL
(SELECT 'Apple' AS org_name, '2' AS id)
UNION ALL
(SELECT 'Netflix' AS org_name, '3' AS id)
)
...
预期结果:
+---------------------+-----------+---+
| Microsoft | Microsoft | 1 |
+---------------------+-----------+---+
| Micro-soft | Microsoft | 1 |
+---------------------+-----------+---+
| Microsoft.com | Microsoft | 1 |
+---------------------+-----------+---+
| @microsoft | Microsoft | 1 |
+---------------------+-----------+---+
| Microsoft Vancouver | Microsoft | 1 |
+---------------------+-----------+---+
| Apple | Apple | 2 |
+---------------------+-----------+---+
| Netflix | Netflix | 3 |
+---------------------+-----------+---+
此查询完成以下工作:
WITH orgs AS (
(SELECT 'Microsoft' AS org)
UNION ALL
(SELECT 'Micro-soft' AS org)
UNION ALL
(SELECT 'Microsoft.com' AS org)
UNION ALL
(SELECT '@microsoft' AS org)
UNION ALL
(SELECT 'Microsoft Vancouver' AS org)
UNION ALL
(SELECT 'Apple' AS org)
UNION ALL
(SELECT 'Netflix' AS org)
),
orgs_ids AS (
(SELECT 'Microsoft' AS org_name, '1' AS id)
UNION ALL
(SELECT 'Apple' AS org_name, '2' AS id)
UNION ALL
(SELECT 'Netflix' AS org_name, '3' AS id)
),
orgs_match AS (
SELECT
org,
(fhoffa.x.levenshtein(org, org_name)) AS match,
org_name,
id
FROM orgs
JOIN orgs_ids ON TRUE
)
SELECT DISTINCT
org,
FIRST_VALUE(org_name) OVER (PARTITION BY(org) ORDER BY match ASC) as org_name,
FIRST_VALUE(id) OVER (PARTITION BY(org) ORDER BY match ASC) as id
FROM orgs_match
但据我所知,这是非常低效的。 有没有更好的办法?
也许有一种方法可以先运行不太准确但更有效的匹配,以缩小组合数量,然后运行更准确的匹配?
下面是这样做的 - a) 拆分所有名称 b) 将它们发声 c) 根据匹配单词的数量获得获胜者 - 您可以根据需要调整逻辑
select org, org_name, id
from (select * from orgs, unnest(split(org, ' ')) word) o
join (select * from orgs_ids, unnest(split(org_name, ' ')) word) i
on soundex(o.word) = soundex(i.word)
group by org, org_name, id
qualify 1 = row_number() over(partition by org order by count(1))
如果应用于您问题中的样本数据 - 输出是
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.