简体   繁体   English

在 ID 上加入,如果 ID 不匹配,然后在其他列 BigQuery 上匹配

[英]JOIN on ID, IF ID doesn't match then match on other columns BigQuery

I have two tables that I am trying to join.我有两个要加入的表。 The tables have a primary and foreign key but there are some instances where the keys don't match and I need to join on the next best match.这些表有一个主键和外键,但在某些情况下键不匹配,我需要加入下一个最佳匹配。

I tried to use a case statement and it works but because the join isn't perfect.我尝试使用 case 语句并且它可以工作,但是因为连接并不完美。 It will either grab the incorrect value or duplicate the record.它要么获取不正确的值,要么复制记录。

The way the table works is if the Info_ID s don't match up we can use a combination of Lev1 and if the cust_start date is between Info_Start and Info_End该表的工作方式是如果Info_ID不匹配,我们可以使用Lev1的组合,如果cust_start日期在Info_StartInfo_End之间

I need a way to match on the IDs and then the SQL stops matching on that row.我需要一种方法来匹配 ID,然后 SQL 停止匹配该行。 But im not sure if that is something BigQuery can do.但我不确定这是否是 BigQuery 可以做到的。

Customer Table客户表

Cust_ID Cust_InfoID Cust_name   Cust_Start  Cust_Lev1
1111    1           Amy         2021-01-01  A
1112    3           John        2020-01-01  D
1113    8           Bill        2020-01-01  D

Info Table信息表

Info_ID Info_Lev1   Info_Start  Info_End    state
1       A           2021-01-15  2021-01-14  NJ
3       D           2020-01-01  2020-12-31  NY
5       A           2021-01-01  2022-01-31  CA

Expected Result预期结果

Cust_ID Cust_InfoID Info_ID Cust_Lev1   Cust_Start  Info_Start  Info_End    state
1111    1           1       A           2021-01-01  2021-01-15  2021-01-14  NJ
1112    3           3       D           2020-01-01  2020-01-01  2020-12-31  NY
1112    8           3       D           2020-01-01  2020-01-01  2020-12-31  NY

Join Idea 1:加入理念一:

CASE
  WHEN
    (Cust_InfoID = Info_ID) = true
    AND (Cust_Start BETWEEN Info_Start AND Info_End) = true
  THEN
    Cust_InfoID = Info_ID
  ELSE
    Cust_Start BETWEEN Info_Start AND Info_End
    and Info_Lev1 = Cust_Lev1
END

Output:
Cust_ID Cust_InfoID Info_ID Cust_Lev1   Cust_Start  Info_Start  Info_End    state
1111    1           5       A           2021-01-01  2021-01-01  2022-01-31  CA
1112    3           3       D           2020-01-01  2020-01-01  2020-12-31  NY
1113    8           3       D           2020-01-01  2020-01-01  2020-12-31  NY

The problem here is that IDs match but the dates don't so it uses the ELSE statement to join.这里的问题是 ID 匹配但日期不匹配,因此它使用 ELSE 语句加入。 This is incorrect这是不正确的

Join Idea 2:加入理念二:

CASE
  WHEN
    Cust_InfoID = Info_ID
  THEN
    Cust_InfoID = Info_ID
  ELSE
    Cust_Start BETWEEN Info_Start AND Info_End
    and Info_Lev1 = Cust_Lev1
END

Output:
Cust_ID Cust_InfoID Info_ID Cust_Lev1   Cust_Start  Info_Start  Info_End    state
1111    1           1       A           2021-01-01  2021-01-15  2021-01-14  NJ
1111    1           5       A           2021-01-01  2021-01-01  2022-01-31  CA
1112    3           3       D           2020-01-01  2020-01-01  2020-12-31  NY
1113    8           3       D           2020-01-01  2020-01-01  2020-12-31  NY

The problem here is that IDs match but the ELSE statement also matches up the wrong duplicate row.这里的问题是 ID 匹配但 ELSE 语句也匹配了错误的重复行。 This is also incorrect这也是不正确的

Example tables here:此处的示例表:

with customer as (
    SELECT 1111 Cust_ID,1 Cust_InfoID,'Amy' Cust_name,'2021-01-01' Cust_Start,'A' Cust_Lev1
    UNION ALL 
    SELECT 1112,3,'John','2020-01-01','D'
    union all 
    SELECT 1113,8,'Bill','2020-01-01','D'
),
info as (
    select 1 Info_ID,'A' Info_Lev1,'2021-01-15' Info_Start,'2021-01-14' Info_End,'NJ' state
    union all 
    select 3,'D','2020-01-01','2020-12-31','NY'
    union all 
    select 5,'A','2021-01-01','2022-01-31','CA'
)
select Cust_ID,Cust_InfoID,Info_ID,Cust_Lev1,Cust_Start,Info_Start,Info_End,state
from customer 
join info on 
[case statement here]

The way the table works is if the Info_IDs don't match up we can use a combination of Lev1 and if the cust_start date is between Info_Start and Info_End该表的工作方式是,如果 Info_ID 不匹配,我们可以使用 Lev1 的组合,如果 cust_start 日期在 Info_Start 和 Info_End 之间

Use two left join s, one for each of the conditions:使用两个left join ,每个条件一个:

select c.*,
       coalesce(ii.info_start, il.info_start),
       coalesce(ii.info_end, il.info_end),
       coalesce(ii.state, il.state)
from customer c left join
     info ii
     on c.cust_infoid = ii.info_id left join
     info il
     on ii.info_id is null and
        c.cust_lev1 = il.info_lev1 and
        c.cust_start between il.info_start and il.info_end

Consider below ("with one JOIN and a CASE statement" as asked)考虑下面(“用一个 JOIN 和一个 CASE 语句”按要求)

select any_value(c).*, 
  array_agg(i order by 
    case when c.cust_infoid = i.info_id then 1 else 2 end
    limit 1
  )[offset(0)].*
from `project.dataset.customer` c 
join `project.dataset.info` i
on c.cust_infoid = i.info_id 
or(
  c.cust_lev1 = i.info_lev1 and
  c.cust_start between i.info_start and i.info_end
)
group by format('%t', c)

when applied to sample data in your question - output is当应用于您问题中的样本数据时 - output 是

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM