[英]BigQuery: Fuzzy-join two tables
I need to efficiently fuzzy-join two huge tables.我需要有效地模糊连接两个大表。
Sample data:样本数据:
WITH orgs AS (
(SELECT 'Microsoft' AS org)
UNION ALL
(SELECT 'Micro-soft' AS org)
UNION ALL
(SELECT 'Microsoft.com' AS org)
UNION ALL
(SELECT '@microsoft' AS org)
UNION ALL
(SELECT 'Microsoft Vancouver' AS org)
UNION ALL
(SELECT 'Apple' AS org)
UNION ALL
(SELECT 'Netflix' AS org)
),
orgs_ids AS (
(SELECT 'Microsoft' AS org_name, '1' AS id)
UNION ALL
(SELECT 'Apple' AS org_name, '2' AS id)
UNION ALL
(SELECT 'Netflix' AS org_name, '3' AS id)
)
...
Expected result:预期结果:
+---------------------+-----------+---+
| Microsoft | Microsoft | 1 |
+---------------------+-----------+---+
| Micro-soft | Microsoft | 1 |
+---------------------+-----------+---+
| Microsoft.com | Microsoft | 1 |
+---------------------+-----------+---+
| @microsoft | Microsoft | 1 |
+---------------------+-----------+---+
| Microsoft Vancouver | Microsoft | 1 |
+---------------------+-----------+---+
| Apple | Apple | 2 |
+---------------------+-----------+---+
| Netflix | Netflix | 3 |
+---------------------+-----------+---+
This query does the job:此查询完成以下工作:
WITH orgs AS (
(SELECT 'Microsoft' AS org)
UNION ALL
(SELECT 'Micro-soft' AS org)
UNION ALL
(SELECT 'Microsoft.com' AS org)
UNION ALL
(SELECT '@microsoft' AS org)
UNION ALL
(SELECT 'Microsoft Vancouver' AS org)
UNION ALL
(SELECT 'Apple' AS org)
UNION ALL
(SELECT 'Netflix' AS org)
),
orgs_ids AS (
(SELECT 'Microsoft' AS org_name, '1' AS id)
UNION ALL
(SELECT 'Apple' AS org_name, '2' AS id)
UNION ALL
(SELECT 'Netflix' AS org_name, '3' AS id)
),
orgs_match AS (
SELECT
org,
(fhoffa.x.levenshtein(org, org_name)) AS match,
org_name,
id
FROM orgs
JOIN orgs_ids ON TRUE
)
SELECT DISTINCT
org,
FIRST_VALUE(org_name) OVER (PARTITION BY(org) ORDER BY match ASC) as org_name,
FIRST_VALUE(id) OVER (PARTITION BY(org) ORDER BY match ASC) as id
FROM orgs_match
But as far as I understand, this is extremely inefficient.但据我所知,这是非常低效的。 Is there a better way?
有没有更好的办法?
Maybe there is a way to run less accurate yet much more efficient match first to narrow down the number of combinations and then run more accurate matching?
也许有一种方法可以先运行不太准确但更有效的匹配,以缩小组合数量,然后运行更准确的匹配?
Below is kind of doing this - a) splits all names b)soundex them c) gets winners based on count of matched words - you can adjust logic as you wish下面是这样做的 - a) 拆分所有名称 b) 将它们发声 c) 根据匹配单词的数量获得获胜者 - 您可以根据需要调整逻辑
select org, org_name, id
from (select * from orgs, unnest(split(org, ' ')) word) o
join (select * from orgs_ids, unnest(split(org_name, ' ')) word) i
on soundex(o.word) = soundex(i.word)
group by org, org_name, id
qualify 1 = row_number() over(partition by org order by count(1))
if applied to sample data in your question - output is如果应用于您问题中的样本数据 - 输出是
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.