BigQuery：模糊连接两个表

Question

I need to efficiently fuzzy-join two huge tables.我需要有效地模糊连接两个大表。

Sample data:样本数据：

WITH orgs AS (
  (SELECT 'Microsoft' AS org)
  UNION ALL 
  (SELECT 'Micro-soft' AS org)
  UNION ALL 
  (SELECT 'Microsoft.com' AS org)
  UNION ALL 
  (SELECT '@microsoft' AS org)
  UNION ALL 
  (SELECT 'Microsoft Vancouver' AS org)
  UNION ALL 
  (SELECT 'Apple' AS org)
  UNION ALL
  (SELECT 'Netflix' AS org)
),
orgs_ids AS (
  (SELECT 'Microsoft' AS org_name, '1' AS id)
  UNION ALL 
  (SELECT 'Apple' AS org_name, '2' AS id)
  UNION ALL 
  (SELECT 'Netflix' AS org_name, '3' AS id)
)

...

Expected result:预期结果：

+---------------------+-----------+---+
| Microsoft           | Microsoft | 1 |
+---------------------+-----------+---+
| Micro-soft          | Microsoft | 1 |
+---------------------+-----------+---+
| Microsoft.com       | Microsoft | 1 |
+---------------------+-----------+---+
| @microsoft          | Microsoft | 1 |
+---------------------+-----------+---+
| Microsoft Vancouver | Microsoft | 1 |
+---------------------+-----------+---+
| Apple               | Apple     | 2 |
+---------------------+-----------+---+
| Netflix             | Netflix   | 3 |
+---------------------+-----------+---+

This query does the job:此查询完成以下工作：

WITH orgs AS (
  (SELECT 'Microsoft' AS org)
  UNION ALL 
  (SELECT 'Micro-soft' AS org)
  UNION ALL 
  (SELECT 'Microsoft.com' AS org)
  UNION ALL 
  (SELECT '@microsoft' AS org)
  UNION ALL 
  (SELECT 'Microsoft Vancouver' AS org)
  UNION ALL 
  (SELECT 'Apple' AS org)
  UNION ALL
  (SELECT 'Netflix' AS org)
),
orgs_ids AS (
  (SELECT 'Microsoft' AS org_name, '1' AS id)
  UNION ALL 
  (SELECT 'Apple' AS org_name, '2' AS id)
  UNION ALL 
  (SELECT 'Netflix' AS org_name, '3' AS id)
),
orgs_match AS (
  SELECT 
    org,
    (fhoffa.x.levenshtein(org, org_name)) AS match,
    org_name,
    id
  FROM orgs
  JOIN orgs_ids ON TRUE
)
SELECT DISTINCT
  org,
  FIRST_VALUE(org_name) OVER (PARTITION BY(org) ORDER BY match ASC) as org_name,
  FIRST_VALUE(id) OVER (PARTITION BY(org) ORDER BY match ASC) as id
FROM orgs_match

But as far as I understand, this is extremely inefficient.但据我所知，这是非常低效的。 Is there a better way?有没有更好的办法？

Answer 1

Maybe there is a way to run less accurate yet much more efficient match first to narrow down the number of combinations and then run more accurate matching?也许有一种方法可以先运行不太准确但更有效的匹配，以缩小组合数量，然后运行更准确的匹配？

Below is kind of doing this - a) splits all names b)soundex them c) gets winners based on count of matched words - you can adjust logic as you wish下面是这样做的 - a) 拆分所有名称 b) 将它们发声 c) 根据匹配单词的数量获得获胜者 - 您可以根据需要调整逻辑

select org, org_name, id
from (select * from orgs, unnest(split(org, ' ')) word) o 
join (select * from orgs_ids, unnest(split(org_name, ' ')) word) i 
on soundex(o.word) = soundex(i.word)
group by org, org_name, id
qualify 1 = row_number() over(partition by org order by count(1))

if applied to sample data in your question - output is如果应用于您问题中的样本数据 - 输出是

Answer 2

Consider below approach考虑以下方法

select org, 
  array_agg(struct(org_name, id) order by fhoffa.x.levenshtein(org, org_name) limit 1)[offset(0)].* 
from orgs, orgs_ids
group by org

if applied to sample data in your question - output is如果应用于您问题中的样本数据 - 输出是

BigQuery：模糊连接两个表

问题描述

2 个解决方案

解决方案1
1 2021-10-17 18:00:04

解决方案2
0 2021-10-16 19:41:19

BigQuery：模糊连接两个表

问题描述

2 个解决方案

解决方案1 1 2021-10-17 18:00:04

解决方案2 0 2021-10-16 19:41:19

解决方案1
1 2021-10-17 18:00:04

解决方案2
0 2021-10-16 19:41:19