I have two tables that I am trying to join. The tables have a primary and foreign key but there are some instances where the keys don't match and I need to join on the next best match.
I tried to use a case statement and it works but because the join isn't perfect. It will either grab the incorrect value or duplicate the record.
The way the table works is if the Info_ID
s don't match up we can use a combination of Lev1
and if the cust_start
date is between Info_Start
and Info_End
I need a way to match on the IDs and then the SQL stops matching on that row. But im not sure if that is something BigQuery can do.
Customer Table
Cust_ID Cust_InfoID Cust_name Cust_Start Cust_Lev1
1111 1 Amy 2021-01-01 A
1112 3 John 2020-01-01 D
1113 8 Bill 2020-01-01 D
Info Table
Info_ID Info_Lev1 Info_Start Info_End state
1 A 2021-01-15 2021-01-14 NJ
3 D 2020-01-01 2020-12-31 NY
5 A 2021-01-01 2022-01-31 CA
Expected Result
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 1 A 2021-01-01 2021-01-15 2021-01-14 NJ
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1112 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
Join Idea 1:
CASE
WHEN
(Cust_InfoID = Info_ID) = true
AND (Cust_Start BETWEEN Info_Start AND Info_End) = true
THEN
Cust_InfoID = Info_ID
ELSE
Cust_Start BETWEEN Info_Start AND Info_End
and Info_Lev1 = Cust_Lev1
END
Output:
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 5 A 2021-01-01 2021-01-01 2022-01-31 CA
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1113 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
The problem here is that IDs match but the dates don't so it uses the ELSE statement to join. This is incorrect
Join Idea 2:
CASE
WHEN
Cust_InfoID = Info_ID
THEN
Cust_InfoID = Info_ID
ELSE
Cust_Start BETWEEN Info_Start AND Info_End
and Info_Lev1 = Cust_Lev1
END
Output:
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 1 A 2021-01-01 2021-01-15 2021-01-14 NJ
1111 1 5 A 2021-01-01 2021-01-01 2022-01-31 CA
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1113 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
The problem here is that IDs match but the ELSE statement also matches up the wrong duplicate row. This is also incorrect
Example tables here:
with customer as (
SELECT 1111 Cust_ID,1 Cust_InfoID,'Amy' Cust_name,'2021-01-01' Cust_Start,'A' Cust_Lev1
UNION ALL
SELECT 1112,3,'John','2020-01-01','D'
union all
SELECT 1113,8,'Bill','2020-01-01','D'
),
info as (
select 1 Info_ID,'A' Info_Lev1,'2021-01-15' Info_Start,'2021-01-14' Info_End,'NJ' state
union all
select 3,'D','2020-01-01','2020-12-31','NY'
union all
select 5,'A','2021-01-01','2022-01-31','CA'
)
select Cust_ID,Cust_InfoID,Info_ID,Cust_Lev1,Cust_Start,Info_Start,Info_End,state
from customer
join info on
[case statement here]
The way the table works is if the Info_IDs don't match up we can use a combination of Lev1 and if the cust_start date is between Info_Start and Info_End
Use two left join
s, one for each of the conditions:
select c.*,
coalesce(ii.info_start, il.info_start),
coalesce(ii.info_end, il.info_end),
coalesce(ii.state, il.state)
from customer c left join
info ii
on c.cust_infoid = ii.info_id left join
info il
on ii.info_id is null and
c.cust_lev1 = il.info_lev1 and
c.cust_start between il.info_start and il.info_end
Consider below ("with one JOIN and a CASE statement" as asked)
select any_value(c).*,
array_agg(i order by
case when c.cust_infoid = i.info_id then 1 else 2 end
limit 1
)[offset(0)].*
from `project.dataset.customer` c
join `project.dataset.info` i
on c.cust_infoid = i.info_id
or(
c.cust_lev1 = i.info_lev1 and
c.cust_start between i.info_start and i.info_end
)
group by format('%t', c)
when applied to sample data in your question - output is
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.