[英]join two date partitioned parquet hive tables on specific dates
I have two hive parquet tables which are partitioned based on dates, both the tables are having several millions of records, I need to join the two tables based on hundreds of specific dates and select around 150 columns.我有两个 hive 镶木地板表,它们根据日期进行分区,两个表都有数百万条记录,我需要根据数百个特定日期和 select 大约 150 列连接这两个表。 When I tried the below SQL, the query is running for ever.
当我尝试下面的 SQL 时,查询一直在运行。 I'm running this SQL in pyspark, Is there any other way to optimize it?
我在 pyspark 中运行这个 SQL,还有其他方法可以优化它吗?
SQL1: SQL1:
select table_a.a_col1, table_a.a_col2, table_b.b_col1, table_b.b_col2,..., table_a.a_col150, table_b.b_col150
from table_a a join table_b b
on a.col1 = b.col2
and a.date_col = b.date_col
where a.date_col in ('2022-01-01','2022-01-02','2022-01-03','2022-01-04')
and b.date_col in ('2022-01-01','2022-01-02','2022-01-03','2022-01-04')
SQL2: SQL2:
select table_a.a_col1, table_a.a_col2, table_b.b_col1, table_b.b_col2,..., table_a.a_col150, table_b.b_col150
from table_a a join table_b b
on a.col1 = b.col2
where a.date_col in ('2022-01-01','2022-01-02','2022-01-03','2022-01-04')
and b.date_col in ('2022-01-01','2022-01-02','2022-01-03','2022-01-04')
You can store all dates into a lkp table and the join it to table_a, table_b.您可以将所有日期存储到 lkp 表中,并将其连接到 table_a、table_b。
step 1 - first create a lookup table - create table lkp_date_filter (dt timestamp);
第一步——首先创建一个查找表——
create table lkp_date_filter (dt timestamp);
step 2 - insert filter dates into it - insert into lkp_date_filter values('2022-01-04')
第 2 步 - 将过滤器日期插入其中 -
insert into lkp_date_filter values('2022-01-04')
step 3 - join it in your main query and remove IN claues.第 3 步 - 将其加入您的主查询并删除 IN 子句。
select table_a.a_col1, table_a.a_col2, table_b.b_col1, table_b.b_col2,..., table_a.a_col150, table_b.b_col150
from table_a a
join table_b b on a.col1 = b.col2
join lkp_date_filter on a.date_col =lkp.dt and b.date_col =lkp.dt
Step 3 will avoid the expensive IN clause and make SQL fast.第 3 步将避免昂贵的 IN 子句并使 SQL 更快。 Step2 will give you flexibility to change the filter values as per your need.
Step2 将使您可以根据需要灵活地更改过滤器值。 You can partition table a and b on date_col to make SQL faster.
您可以在 date_col 上对表 a 和 b 进行分区,以使 SQL 更快。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.