Hive query optimisation

Question

Have to perform incremental load into an internal table from an external table in hive when the source data file is appended with new records, on a daily basis. The new records can be filtered out based on the timestamp(column load_ts in the table) at which they were loaded. Trying to achieve this by selecting the records from source table whose load_ts is greater than the current max(load_ts) in the target table as given below:

INSERT INTO TABLE target_temp PARTITION (DATA_DT)
SELECT ms.* FROM temp_db.source_temp ms 
JOIN (select max(load_ts) max_load_ts from target_temp) mt
ON 1=1
WHERE
ms.load_ts > mt.max_load_ts;

But the above query does not give the desired output. Takes very long time for execution (should not be the case with Map-Reduce paradigm ).

Tried other scenarios also like passing the max(load_ts) as a variable, instead of joining. Still no improvement in the performance. Would be very helpful if anyone can give their insights as to what is possibly incorrect in this approach, with any alternate solutions.

Answer 1

First of all, the map/reduce model does not guarantee that your queries will take less. The main idea is that its performance will scale linearly with the number of nodes, but you have to still think about how you're doing things, more so than in normal SQL.

First thing to check is if the source table is partitioned by time. If not, it should as you'd be reading the whole table every single time. Second, you're calculating the max as well every time, also, on the whole destination table. You could make it a lot faster if you just calculate the max on the last partition, so change this

JOIN (select max(load_ts) max_load_ts from target_temp) mt

to this (you didn't write the partition column so I am going to assume it's called 'dt'

JOIN (select max(load_ts) max_load_ts from target_temp WHERE dt=PREVIOUS_DATA_DT) mt

since we know the max load_ts is going to be in the last partition.

Otherwise, it's hard to help without knowing the structure of the source table, and, like somebody else commented, the sizes of the two tables.

Answer 2

JOIN is slower than variable in the WHERE clause. But the main problem with performance here is that your query performs full scan of target table and source table. I would recommend:

Query only the latest partition for max(load_ts).

Enable statistics gathering and usage

set hive.compute.query.using.stats=true; set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true; set hive.stats.autogather=true;

Compute statistics on both tables for columns. Statistics will make queries like selecting MAX(partition) or max(ts) executing faster

Try to put source partition files into target partition folder instead of INSERT if applicable (target and source tables partitioning and storage format should enable this). It works fine for example for textfile storage format and if source table partition contain only rows>max(target_partition). You can combine both copy files method(for those source partitions that exactly contain rows to be inserting without filtering) and INSERT(for partitions containing mixed data that need to be filtering).
Hive may be merging your files during INSERT. This merge phase takes additional time and adds additional stage job. Check hive.merge.mapredfiles option and try to switch it off.
And of course use pre-calculated variable instead of join.

Answer 3

Use Cost-Based Optimisation Technique by enabling below properties

set hive.cbo.enable=true;
set hive.stats.autogather=true;
set hive.stats.fetch.column.stats=true;
set hive.compute.query.using.stats=true;
set hive.vectorized.execution.enabled=true;
set hive.exec.parallel=true;

Also analyze the table

ANALYZE TABLE temp_db.source_temp COMPUTE STATISTICS [comma_separated_column_list];
ANALYZE TABLE target_temp PARTITION(DATA_DT) COMPUTE STATISTICS;

Hive query optimisation

Question

3 answers

solution1
0 2015-12-23 11:25:25

solution2
0 2015-12-25 10:18:48

solution3
0 2020-03-26 15:40:31

Hive query optimisation

Question

3 answers

solution1 0 2015-12-23 11:25:25

solution2 0 2015-12-25 10:18:48

solution3 0 2020-03-26 15:40:31

solution1
0 2015-12-23 11:25:25

solution2
0 2015-12-25 10:18:48

solution3
0 2020-03-26 15:40:31