简体   繁体   English

GCP Bigquery 上 DBT 中的 Left Join 性能

[英]Left Join performance in DBT on GCP Bigquery

I'm trying to build a reusable model which will be used in multiple data pipelines constructed on DBT running on GCP Bigquery.我正在尝试构建一个可重复使用的 model,它将用于在 GCP Bigquery 上运行的 DBT 上构建的多个数据管道。 I basically want to get a column value from two disjointed tables.我基本上想从两个脱节的表中获取一个列值。

I created a DBT model UNION-ing those two tables我创建了一个 DBT model UNION-ing 这两个表

data for tbl1 - col1#####col2#######col3 ABC1 ###### A1 ######## T1 ABC2 ###### A2 ######## T2 ABC3 ###### A3 ######## T3 tbl1 的数据 - col1#####col2#######col3 ABC1 ###### A1 ######## T1 ABC2 ###### A2 ###### ## T2 ABC3 ###### A3 ######## T3

data for tbl2 - col1 ###### col2 A1 ######## T11 A2 ######## T21 A3 ######## T31 tbl2 的数据 - col1 ###### col2 A1 ######## T11 A2 ######## T21 A3 ######## T31

data for fact table - col1 ####### col2 ABC1 ####### XX ABC2 ####### YY XYZ ####### A1 DEF ####### A3事实表的数据 - col1 ####### col2 ABC1 ####### XX ABC2 ####### YY XYZ ####### A1 DEF ####### A3

For the data in fact table, row 1 & 2 must match with the tbl1 data and row 3 & 4 must match with the tbl2 data.对于事实表中的数据,第 1 行和第 2 行必须与 tbl1 数据匹配,第 3 行和第 4 行必须与 tbl2 数据匹配。

with tbl1 as 
(select col1,col2,col3
from project1.dataset1.table1 ) 

,tbl2 as 
(select col2, col3
from project1.dataset1.table2 )

,tbl1_tbl2 as 
(
select col3, 
case when col1 is not null then col1
when col2 is not null then col2 end as single_col
from (
select col3,
case when flg = 'tbl1' then col1 end as col1,
case when flg = 'tbl2' then col2 end as col2
from
(select 'tbl2' as flg, null as col1 ,col3, col2 from tbl2
union all 
select 'tbl1' as flg, col1,col3,col2 from tbl1
)
) 
)

This is how the model will be used in data pipeline - fact joined with the model. But the challenge here is that when either of the JOIN columns' value is matching with the UNION-ed tables then I need the data.这就是 model 将在数据管道中使用的方式 - 事实上与 model 结合。但这里的挑战是,当任何一个 JOIN 列的值与 UNION-ed 表匹配时,我需要数据。 But the OR condition in the LEFT JOIN.....ON fails in Bigquery.但是 LEFT JOIN.....ON 中的 OR 条件在 Bigquery 中失败了。 When I say fails is that BQ goes on for a ride.当我说失败时,BQ 继续兜风。 I'm not sure if I'm approaching it correctly or if there's a better way to solve this issue.我不确定我是否正确地处理了它,或者是否有更好的方法来解决这个问题。 Please help!请帮忙!


select col3
from project1.dataset1.fact_table fact
left join 
tbl1_tbl2 tbl
on 
tbl.single_col = fact.col1
or tbl.single_col = fact.col2 --This is where BQ fails

Using an or in a join predicate causes terrible performance in many different warehouses.在连接谓词中使用or会在许多不同的仓库中导致糟糕的性能。 You can replace this with two left joins, and coalesce the values from the two joined tables:您可以将其替换为两个左联接,并coalesce两个联接表中的值:

select coalesce(a.col3, b.col3) as col3
from project1.dataset1.fact_table fact
left join tbl1_tbl2 a
    on a.single_col = fact.col1
left join tbl1_tbl2 b
    on b.single_col = fact.col2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM