[英]Left Join performance in DBT on GCP Bigquery
I'm trying to build a reusable model which will be used in multiple data pipelines constructed on DBT running on GCP Bigquery.我正在尝试构建一个可重复使用的 model,它将用于在 GCP Bigquery 上运行的 DBT 上构建的多个数据管道。 I basically want to get a column value from two disjointed tables.
我基本上想从两个脱节的表中获取一个列值。
I created a DBT model UNION-ing those two tables我创建了一个 DBT model UNION-ing 这两个表
data for tbl1 - col1#####col2#######col3 ABC1 ###### A1 ######## T1 ABC2 ###### A2 ######## T2 ABC3 ###### A3 ######## T3 tbl1 的数据 - col1#####col2#######col3 ABC1 ###### A1 ######## T1 ABC2 ###### A2 ###### ## T2 ABC3 ###### A3 ######## T3
data for tbl2 - col1 ###### col2 A1 ######## T11 A2 ######## T21 A3 ######## T31 tbl2 的数据 - col1 ###### col2 A1 ######## T11 A2 ######## T21 A3 ######## T31
data for fact table - col1 ####### col2 ABC1 ####### XX ABC2 ####### YY XYZ ####### A1 DEF ####### A3事实表的数据 - col1 ####### col2 ABC1 ####### XX ABC2 ####### YY XYZ ####### A1 DEF ####### A3
For the data in fact table, row 1 & 2 must match with the tbl1 data and row 3 & 4 must match with the tbl2 data.对于事实表中的数据,第 1 行和第 2 行必须与 tbl1 数据匹配,第 3 行和第 4 行必须与 tbl2 数据匹配。
with tbl1 as
(select col1,col2,col3
from project1.dataset1.table1 )
,tbl2 as
(select col2, col3
from project1.dataset1.table2 )
,tbl1_tbl2 as
(
select col3,
case when col1 is not null then col1
when col2 is not null then col2 end as single_col
from (
select col3,
case when flg = 'tbl1' then col1 end as col1,
case when flg = 'tbl2' then col2 end as col2
from
(select 'tbl2' as flg, null as col1 ,col3, col2 from tbl2
union all
select 'tbl1' as flg, col1,col3,col2 from tbl1
)
)
)
This is how the model will be used in data pipeline - fact joined with the model. But the challenge here is that when either of the JOIN columns' value is matching with the UNION-ed tables then I need the data.这就是 model 将在数据管道中使用的方式 - 事实上与 model 结合。但这里的挑战是,当任何一个 JOIN 列的值与 UNION-ed 表匹配时,我需要数据。 But the OR condition in the LEFT JOIN.....ON fails in Bigquery.
但是 LEFT JOIN.....ON 中的 OR 条件在 Bigquery 中失败了。 When I say fails is that BQ goes on for a ride.
当我说失败时,BQ 继续兜风。 I'm not sure if I'm approaching it correctly or if there's a better way to solve this issue.
我不确定我是否正确地处理了它,或者是否有更好的方法来解决这个问题。 Please help!
请帮忙!
select col3
from project1.dataset1.fact_table fact
left join
tbl1_tbl2 tbl
on
tbl.single_col = fact.col1
or tbl.single_col = fact.col2 --This is where BQ fails
Using an or
in a join predicate causes terrible performance in many different warehouses.在连接谓词中使用
or
会在许多不同的仓库中导致糟糕的性能。 You can replace this with two left joins, and coalesce
the values from the two joined tables:您可以将其替换为两个左联接,并
coalesce
两个联接表中的值:
select coalesce(a.col3, b.col3) as col3
from project1.dataset1.fact_table fact
left join tbl1_tbl2 a
on a.single_col = fact.col1
left join tbl1_tbl2 b
on b.single_col = fact.col2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.