簡體   English   中英

BigQuery:模糊連接兩個表

[英]BigQuery: Fuzzy-join two tables

我需要有效地模糊連接兩個大表。

樣本數據:

WITH orgs AS (
  (SELECT 'Microsoft' AS org)
  UNION ALL 
  (SELECT 'Micro-soft' AS org)
  UNION ALL 
  (SELECT 'Microsoft.com' AS org)
  UNION ALL 
  (SELECT '@microsoft' AS org)
  UNION ALL 
  (SELECT 'Microsoft Vancouver' AS org)
  UNION ALL 
  (SELECT 'Apple' AS org)
  UNION ALL
  (SELECT 'Netflix' AS org)
),
orgs_ids AS (
  (SELECT 'Microsoft' AS org_name, '1' AS id)
  UNION ALL 
  (SELECT 'Apple' AS org_name, '2' AS id)
  UNION ALL 
  (SELECT 'Netflix' AS org_name, '3' AS id)
)

...

預期結果:

+---------------------+-----------+---+
| Microsoft           | Microsoft | 1 |
+---------------------+-----------+---+
| Micro-soft          | Microsoft | 1 |
+---------------------+-----------+---+
| Microsoft.com       | Microsoft | 1 |
+---------------------+-----------+---+
| @microsoft          | Microsoft | 1 |
+---------------------+-----------+---+
| Microsoft Vancouver | Microsoft | 1 |
+---------------------+-----------+---+
| Apple               | Apple     | 2 |
+---------------------+-----------+---+
| Netflix             | Netflix   | 3 |
+---------------------+-----------+---+

此查詢完成以下工作:

WITH orgs AS (
  (SELECT 'Microsoft' AS org)
  UNION ALL 
  (SELECT 'Micro-soft' AS org)
  UNION ALL 
  (SELECT 'Microsoft.com' AS org)
  UNION ALL 
  (SELECT '@microsoft' AS org)
  UNION ALL 
  (SELECT 'Microsoft Vancouver' AS org)
  UNION ALL 
  (SELECT 'Apple' AS org)
  UNION ALL
  (SELECT 'Netflix' AS org)
),
orgs_ids AS (
  (SELECT 'Microsoft' AS org_name, '1' AS id)
  UNION ALL 
  (SELECT 'Apple' AS org_name, '2' AS id)
  UNION ALL 
  (SELECT 'Netflix' AS org_name, '3' AS id)
),
orgs_match AS (
  SELECT 
    org,
    (fhoffa.x.levenshtein(org, org_name)) AS match,
    org_name,
    id
  FROM orgs
  JOIN orgs_ids ON TRUE
)
SELECT DISTINCT
  org,
  FIRST_VALUE(org_name) OVER (PARTITION BY(org) ORDER BY match ASC) as org_name,
  FIRST_VALUE(id) OVER (PARTITION BY(org) ORDER BY match ASC) as id
FROM orgs_match

但據我所知,這是非常低效的。 有沒有更好的辦法?

也許有一種方法可以先運行不太准確但更有效的匹配,以縮小組合數量,然后運行更准確的匹配?

下面是這樣做的 - a) 拆分所有名稱 b) 將它們發聲 c) 根據匹配單詞的數量獲得獲勝者 - 您可以根據需要調整邏輯

select org, org_name, id
from (select * from orgs, unnest(split(org, ' ')) word) o 
join (select * from orgs_ids, unnest(split(org_name, ' ')) word) i 
on soundex(o.word) = soundex(i.word)
group by org, org_name, id
qualify 1 = row_number() over(partition by org order by count(1))

如果應用於您問題中的樣本數據 - 輸出是

在此處輸入圖片說明

考慮以下方法

select org, 
  array_agg(struct(org_name, id) order by fhoffa.x.levenshtein(org, org_name) limit 1)[offset(0)].* 
from orgs, orgs_ids
group by org

如果應用於您問題中的樣本數據 - 輸出是

在此處輸入圖片說明

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM