GCP Bigquery 上 DBT 中的 Left Join 性能

Question

我正在尝试构建一个可重复使用的 model，它将用于在 GCP Bigquery 上运行的 DBT 上构建的多个数据管道。 我基本上想从两个脱节的表中获取一个列值。

我创建了一个 DBT model UNION-ing 这两个表

tbl1 的数据 - col1#####col2#######col3 ABC1 ###### A1 ######## T1 ABC2 ###### A2 ###### ## T2 ABC3 ###### A3 ######## T3

tbl2 的数据 - col1 ###### col2 A1 ######## T11 A2 ######## T21 A3 ######## T31

事实表的数据 - col1 ####### col2 ABC1 ####### XX ABC2 ####### YY XYZ ####### A1 DEF ####### A3

对于事实表中的数据，第 1 行和第 2 行必须与 tbl1 数据匹配，第 3 行和第 4 行必须与 tbl2 数据匹配。

with tbl1 as 
(select col1,col2,col3
from project1.dataset1.table1 ) 

,tbl2 as 
(select col2, col3
from project1.dataset1.table2 )

,tbl1_tbl2 as 
(
select col3, 
case when col1 is not null then col1
when col2 is not null then col2 end as single_col
from (
select col3,
case when flg = 'tbl1' then col1 end as col1,
case when flg = 'tbl2' then col2 end as col2
from
(select 'tbl2' as flg, null as col1 ,col3, col2 from tbl2
union all 
select 'tbl1' as flg, col1,col3,col2 from tbl1
)
) 
)

这就是 model 将在数据管道中使用的方式 - 事实上与 model 结合。但这里的挑战是，当任何一个 JOIN 列的值与 UNION-ed 表匹配时，我需要数据。 但是 LEFT JOIN.....ON 中的 OR 条件在 Bigquery 中失败了。 当我说失败时，BQ 继续兜风。 我不确定我是否正确地处理了它，或者是否有更好的方法来解决这个问题。 请帮忙！


select col3
from project1.dataset1.fact_table fact
left join 
tbl1_tbl2 tbl
on 
tbl.single_col = fact.col1
or tbl.single_col = fact.col2 --This is where BQ fails

Answer 1

在连接谓词中使用or会在许多不同的仓库中导致糟糕的性能。 您可以将其替换为两个左联接，并coalesce两个联接表中的值：

select coalesce(a.col3, b.col3) as col3
from project1.dataset1.fact_table fact
left join tbl1_tbl2 a
    on a.single_col = fact.col1
left join tbl1_tbl2 b
    on b.single_col = fact.col2

GCP Bigquery 上 DBT 中的 Left Join 性能

问题描述

1 个解决方案

解决方案1
0 2023-01-17 17:29:41

GCP Bigquery 上 DBT 中的 Left Join 性能

问题描述

1 个解决方案

解决方案1 0 2023-01-17 17:29:41

解决方案1
0 2023-01-17 17:29:41