[英]JOIN on ID, IF ID doesn't match then match on other columns BigQuery
I have two tables that I am trying to join.我有两个要加入的表。 The tables have a primary and foreign key but there are some instances where the keys don't match and I need to join on the next best match.
这些表有一个主键和外键,但在某些情况下键不匹配,我需要加入下一个最佳匹配。
I tried to use a case statement and it works but because the join isn't perfect.我尝试使用 case 语句并且它可以工作,但是因为连接并不完美。 It will either grab the incorrect value or duplicate the record.
它要么获取不正确的值,要么复制记录。
The way the table works is if the Info_ID
s don't match up we can use a combination of Lev1
and if the cust_start
date is between Info_Start
and Info_End
该表的工作方式是如果
Info_ID
不匹配,我们可以使用Lev1
的组合,如果cust_start
日期在Info_Start
和Info_End
之间
I need a way to match on the IDs and then the SQL stops matching on that row.我需要一种方法来匹配 ID,然后 SQL 停止匹配该行。 But im not sure if that is something BigQuery can do.
但我不确定这是否是 BigQuery 可以做到的。
Customer Table客户表
Cust_ID Cust_InfoID Cust_name Cust_Start Cust_Lev1
1111 1 Amy 2021-01-01 A
1112 3 John 2020-01-01 D
1113 8 Bill 2020-01-01 D
Info Table信息表
Info_ID Info_Lev1 Info_Start Info_End state
1 A 2021-01-15 2021-01-14 NJ
3 D 2020-01-01 2020-12-31 NY
5 A 2021-01-01 2022-01-31 CA
Expected Result预期结果
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 1 A 2021-01-01 2021-01-15 2021-01-14 NJ
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1112 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
Join Idea 1:加入理念一:
CASE
WHEN
(Cust_InfoID = Info_ID) = true
AND (Cust_Start BETWEEN Info_Start AND Info_End) = true
THEN
Cust_InfoID = Info_ID
ELSE
Cust_Start BETWEEN Info_Start AND Info_End
and Info_Lev1 = Cust_Lev1
END
Output:
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 5 A 2021-01-01 2021-01-01 2022-01-31 CA
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1113 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
The problem here is that IDs match but the dates don't so it uses the ELSE statement to join.这里的问题是 ID 匹配但日期不匹配,因此它使用 ELSE 语句加入。 This is incorrect
这是不正确的
Join Idea 2:加入理念二:
CASE
WHEN
Cust_InfoID = Info_ID
THEN
Cust_InfoID = Info_ID
ELSE
Cust_Start BETWEEN Info_Start AND Info_End
and Info_Lev1 = Cust_Lev1
END
Output:
Cust_ID Cust_InfoID Info_ID Cust_Lev1 Cust_Start Info_Start Info_End state
1111 1 1 A 2021-01-01 2021-01-15 2021-01-14 NJ
1111 1 5 A 2021-01-01 2021-01-01 2022-01-31 CA
1112 3 3 D 2020-01-01 2020-01-01 2020-12-31 NY
1113 8 3 D 2020-01-01 2020-01-01 2020-12-31 NY
The problem here is that IDs match but the ELSE statement also matches up the wrong duplicate row.这里的问题是 ID 匹配但 ELSE 语句也匹配了错误的重复行。 This is also incorrect
这也是不正确的
Example tables here:此处的示例表:
with customer as (
SELECT 1111 Cust_ID,1 Cust_InfoID,'Amy' Cust_name,'2021-01-01' Cust_Start,'A' Cust_Lev1
UNION ALL
SELECT 1112,3,'John','2020-01-01','D'
union all
SELECT 1113,8,'Bill','2020-01-01','D'
),
info as (
select 1 Info_ID,'A' Info_Lev1,'2021-01-15' Info_Start,'2021-01-14' Info_End,'NJ' state
union all
select 3,'D','2020-01-01','2020-12-31','NY'
union all
select 5,'A','2021-01-01','2022-01-31','CA'
)
select Cust_ID,Cust_InfoID,Info_ID,Cust_Lev1,Cust_Start,Info_Start,Info_End,state
from customer
join info on
[case statement here]
The way the table works is if the Info_IDs don't match up we can use a combination of Lev1 and if the cust_start date is between Info_Start and Info_End
该表的工作方式是,如果 Info_ID 不匹配,我们可以使用 Lev1 的组合,如果 cust_start 日期在 Info_Start 和 Info_End 之间
Use two left join
s, one for each of the conditions:使用两个
left join
,每个条件一个:
select c.*,
coalesce(ii.info_start, il.info_start),
coalesce(ii.info_end, il.info_end),
coalesce(ii.state, il.state)
from customer c left join
info ii
on c.cust_infoid = ii.info_id left join
info il
on ii.info_id is null and
c.cust_lev1 = il.info_lev1 and
c.cust_start between il.info_start and il.info_end
Consider below ("with one JOIN and a CASE statement" as asked)考虑下面(“用一个 JOIN 和一个 CASE 语句”按要求)
select any_value(c).*,
array_agg(i order by
case when c.cust_infoid = i.info_id then 1 else 2 end
limit 1
)[offset(0)].*
from `project.dataset.customer` c
join `project.dataset.info` i
on c.cust_infoid = i.info_id
or(
c.cust_lev1 = i.info_lev1 and
c.cust_start between i.info_start and i.info_end
)
group by format('%t', c)
when applied to sample data in your question - output is当应用于您问题中的样本数据时 - output 是
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.