[英]BigQuery: Fuzzy-join two tables
我需要有效地模糊連接兩個大表。
樣本數據:
WITH orgs AS (
(SELECT 'Microsoft' AS org)
UNION ALL
(SELECT 'Micro-soft' AS org)
UNION ALL
(SELECT 'Microsoft.com' AS org)
UNION ALL
(SELECT '@microsoft' AS org)
UNION ALL
(SELECT 'Microsoft Vancouver' AS org)
UNION ALL
(SELECT 'Apple' AS org)
UNION ALL
(SELECT 'Netflix' AS org)
),
orgs_ids AS (
(SELECT 'Microsoft' AS org_name, '1' AS id)
UNION ALL
(SELECT 'Apple' AS org_name, '2' AS id)
UNION ALL
(SELECT 'Netflix' AS org_name, '3' AS id)
)
...
預期結果:
+---------------------+-----------+---+
| Microsoft | Microsoft | 1 |
+---------------------+-----------+---+
| Micro-soft | Microsoft | 1 |
+---------------------+-----------+---+
| Microsoft.com | Microsoft | 1 |
+---------------------+-----------+---+
| @microsoft | Microsoft | 1 |
+---------------------+-----------+---+
| Microsoft Vancouver | Microsoft | 1 |
+---------------------+-----------+---+
| Apple | Apple | 2 |
+---------------------+-----------+---+
| Netflix | Netflix | 3 |
+---------------------+-----------+---+
此查詢完成以下工作:
WITH orgs AS (
(SELECT 'Microsoft' AS org)
UNION ALL
(SELECT 'Micro-soft' AS org)
UNION ALL
(SELECT 'Microsoft.com' AS org)
UNION ALL
(SELECT '@microsoft' AS org)
UNION ALL
(SELECT 'Microsoft Vancouver' AS org)
UNION ALL
(SELECT 'Apple' AS org)
UNION ALL
(SELECT 'Netflix' AS org)
),
orgs_ids AS (
(SELECT 'Microsoft' AS org_name, '1' AS id)
UNION ALL
(SELECT 'Apple' AS org_name, '2' AS id)
UNION ALL
(SELECT 'Netflix' AS org_name, '3' AS id)
),
orgs_match AS (
SELECT
org,
(fhoffa.x.levenshtein(org, org_name)) AS match,
org_name,
id
FROM orgs
JOIN orgs_ids ON TRUE
)
SELECT DISTINCT
org,
FIRST_VALUE(org_name) OVER (PARTITION BY(org) ORDER BY match ASC) as org_name,
FIRST_VALUE(id) OVER (PARTITION BY(org) ORDER BY match ASC) as id
FROM orgs_match
但據我所知,這是非常低效的。 有沒有更好的辦法?
也許有一種方法可以先運行不太准確但更有效的匹配,以縮小組合數量,然后運行更准確的匹配?
下面是這樣做的 - a) 拆分所有名稱 b) 將它們發聲 c) 根據匹配單詞的數量獲得獲勝者 - 您可以根據需要調整邏輯
select org, org_name, id
from (select * from orgs, unnest(split(org, ' ')) word) o
join (select * from orgs_ids, unnest(split(org_name, ' ')) word) i
on soundex(o.word) = soundex(i.word)
group by org, org_name, id
qualify 1 = row_number() over(partition by org order by count(1))
如果應用於您問題中的樣本數據 - 輸出是
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.