简体   繁体   中英

JOIN on ID, IF ID doesn't match then match on other columns BigQuery

I have two tables that I am trying to join. The tables have a primary and foreign key but there are some instances where the keys don't match and I need to join on the next best match.

I tried to use a case statement and it works but because the join isn't perfect. It will either grab the incorrect value or duplicate the record.

The way the table works is if the Info_ID s don't match up we can use a combination of Lev1 and if the cust_start date is between Info_Start and Info_End

I need a way to match on the IDs and then the SQL stops matching on that row. But im not sure if that is something BigQuery can do.

Customer Table

Cust_ID Cust_InfoID Cust_name   Cust_Start  Cust_Lev1
1111    1           Amy         2021-01-01  A
1112    3           John        2020-01-01  D
1113    8           Bill        2020-01-01  D

Info Table

Info_ID Info_Lev1   Info_Start  Info_End    state
1       A           2021-01-15  2021-01-14  NJ
3       D           2020-01-01  2020-12-31  NY
5       A           2021-01-01  2022-01-31  CA

Expected Result

Cust_ID Cust_InfoID Info_ID Cust_Lev1   Cust_Start  Info_Start  Info_End    state
1111    1           1       A           2021-01-01  2021-01-15  2021-01-14  NJ
1112    3           3       D           2020-01-01  2020-01-01  2020-12-31  NY
1112    8           3       D           2020-01-01  2020-01-01  2020-12-31  NY

Join Idea 1:

CASE
  WHEN
    (Cust_InfoID = Info_ID) = true
    AND (Cust_Start BETWEEN Info_Start AND Info_End) = true
  THEN
    Cust_InfoID = Info_ID
  ELSE
    Cust_Start BETWEEN Info_Start AND Info_End
    and Info_Lev1 = Cust_Lev1
END

Output:
Cust_ID Cust_InfoID Info_ID Cust_Lev1   Cust_Start  Info_Start  Info_End    state
1111    1           5       A           2021-01-01  2021-01-01  2022-01-31  CA
1112    3           3       D           2020-01-01  2020-01-01  2020-12-31  NY
1113    8           3       D           2020-01-01  2020-01-01  2020-12-31  NY

The problem here is that IDs match but the dates don't so it uses the ELSE statement to join. This is incorrect

Join Idea 2:

CASE
  WHEN
    Cust_InfoID = Info_ID
  THEN
    Cust_InfoID = Info_ID
  ELSE
    Cust_Start BETWEEN Info_Start AND Info_End
    and Info_Lev1 = Cust_Lev1
END

Output:
Cust_ID Cust_InfoID Info_ID Cust_Lev1   Cust_Start  Info_Start  Info_End    state
1111    1           1       A           2021-01-01  2021-01-15  2021-01-14  NJ
1111    1           5       A           2021-01-01  2021-01-01  2022-01-31  CA
1112    3           3       D           2020-01-01  2020-01-01  2020-12-31  NY
1113    8           3       D           2020-01-01  2020-01-01  2020-12-31  NY

The problem here is that IDs match but the ELSE statement also matches up the wrong duplicate row. This is also incorrect

Example tables here:

with customer as (
    SELECT 1111 Cust_ID,1 Cust_InfoID,'Amy' Cust_name,'2021-01-01' Cust_Start,'A' Cust_Lev1
    UNION ALL 
    SELECT 1112,3,'John','2020-01-01','D'
    union all 
    SELECT 1113,8,'Bill','2020-01-01','D'
),
info as (
    select 1 Info_ID,'A' Info_Lev1,'2021-01-15' Info_Start,'2021-01-14' Info_End,'NJ' state
    union all 
    select 3,'D','2020-01-01','2020-12-31','NY'
    union all 
    select 5,'A','2021-01-01','2022-01-31','CA'
)
select Cust_ID,Cust_InfoID,Info_ID,Cust_Lev1,Cust_Start,Info_Start,Info_End,state
from customer 
join info on 
[case statement here]

The way the table works is if the Info_IDs don't match up we can use a combination of Lev1 and if the cust_start date is between Info_Start and Info_End

Use two left join s, one for each of the conditions:

select c.*,
       coalesce(ii.info_start, il.info_start),
       coalesce(ii.info_end, il.info_end),
       coalesce(ii.state, il.state)
from customer c left join
     info ii
     on c.cust_infoid = ii.info_id left join
     info il
     on ii.info_id is null and
        c.cust_lev1 = il.info_lev1 and
        c.cust_start between il.info_start and il.info_end

Consider below ("with one JOIN and a CASE statement" as asked)

select any_value(c).*, 
  array_agg(i order by 
    case when c.cust_infoid = i.info_id then 1 else 2 end
    limit 1
  )[offset(0)].*
from `project.dataset.customer` c 
join `project.dataset.info` i
on c.cust_infoid = i.info_id 
or(
  c.cust_lev1 = i.info_lev1 and
  c.cust_start between i.info_start and i.info_end
)
group by format('%t', c)

when applied to sample data in your question - output is

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM