在特定日期连接两个日期分区拼花 hive 表

Question

I have two hive parquet tables which are partitioned based on dates, both the tables are having several millions of records, I need to join the two tables based on hundreds of specific dates and select around 150 columns.我有两个 hive 镶木地板表，它们根据日期进行分区，两个表都有数百万条记录，我需要根据数百个特定日期和 select 大约 150 列连接这两个表。 When I tried the below SQL, the query is running for ever.当我尝试下面的 SQL 时，查询一直在运行。 I'm running this SQL in pyspark, Is there any other way to optimize it?我在 pyspark 中运行这个 SQL，还有其他方法可以优化它吗？

SQL1: SQL1：

select table_a.a_col1, table_a.a_col2, table_b.b_col1, table_b.b_col2,..., table_a.a_col150, table_b.b_col150
from table_a a join table_b b
on a.col1 = b.col2
and a.date_col = b.date_col
where a.date_col in ('2022-01-01','2022-01-02','2022-01-03','2022-01-04')
and b.date_col in ('2022-01-01','2022-01-02','2022-01-03','2022-01-04')

SQL2: SQL2：

select table_a.a_col1, table_a.a_col2, table_b.b_col1, table_b.b_col2,..., table_a.a_col150, table_b.b_col150
from table_a a join table_b b
on a.col1 = b.col2
where a.date_col in ('2022-01-01','2022-01-02','2022-01-03','2022-01-04')
and b.date_col in ('2022-01-01','2022-01-02','2022-01-03','2022-01-04')

Answer 1

You can store all dates into a lkp table and the join it to table_a, table_b.您可以将所有日期存储到 lkp 表中，并将其连接到 table_a、table_b。

step 1 - first create a lookup table - create table lkp_date_filter (dt timestamp);第一步——首先创建一个查找表—— create table lkp_date_filter (dt timestamp);
step 2 - insert filter dates into it - insert into lkp_date_filter values('2022-01-04')第 2 步 - 将过滤器日期插入其中 - insert into lkp_date_filter values('2022-01-04')
step 3 - join it in your main query and remove IN claues.第 3 步 - 将其加入您的主查询并删除 IN 子句。

select table_a.a_col1, table_a.a_col2, table_b.b_col1, table_b.b_col2,..., table_a.a_col150, table_b.b_col150
from table_a a 
join table_b b on a.col1 = b.col2
join lkp_date_filter on a.date_col =lkp.dt and b.date_col =lkp.dt

Step 3 will avoid the expensive IN clause and make SQL fast.第 3 步将避免昂贵的 IN 子句并使 SQL 更快。 Step2 will give you flexibility to change the filter values as per your need. Step2 将使您可以根据需要灵活地更改过滤器值。 You can partition table a and b on date_col to make SQL faster.您可以在 date_col 上对表 a 和 b 进行分区，以使 SQL 更快。

在特定日期连接两个日期分区拼花 hive 表

问题描述

1 个解决方案

解决方案1
0 2022-08-23 07:59:44

在特定日期连接两个日期分区拼花 hive 表

问题描述

1 个解决方案

解决方案1 0 2022-08-23 07:59:44

解决方案1
0 2022-08-23 07:59:44