简体   繁体   English

Hive 查询优化

[英]Hive query optimisation

Have to perform incremental load into an internal table from an external table in hive when the source data file is appended with new records, on a daily basis.当源数据文件附加新记录时,必须每天从 hive 中的外部表执行增量加载到内部表中。 The new records can be filtered out based on the timestamp(column load_ts in the table) at which they were loaded.可以根据加载新记录的时间戳(表中的列load_ts )过滤掉新记录。 Trying to achieve this by selecting the records from source table whose load_ts is greater than the current max(load_ts) in the target table as given below:尝试通过从源表中选择 load_ts 大于目标表中当前max(load_ts)的记录来实现这一点,如下所示:

INSERT INTO TABLE target_temp PARTITION (DATA_DT)
SELECT ms.* FROM temp_db.source_temp ms 
JOIN (select max(load_ts) max_load_ts from target_temp) mt
ON 1=1
WHERE
ms.load_ts > mt.max_load_ts;

But the above query does not give the desired output.但是上面的查询没有给出所需的输出。 Takes very long time for execution (should not be the case with Map-Reduce paradigm ).执行需要很长时间( Map-Reduce 范式不应该是这种情况)。

Tried other scenarios also like passing the max(load_ts) as a variable, instead of joining.尝试了其他场景,比如将max(load_ts)作为变量传递,而不是加入。 Still no improvement in the performance.性能仍然没有改善。 Would be very helpful if anyone can give their insights as to what is possibly incorrect in this approach, with any alternate solutions.如果任何人都可以就这种方法中可能不正确的内容以及任何替代解决方案发表见解,那将非常有帮助。

First of all, the map/reduce model does not guarantee that your queries will take less.首先,map/reduce 模型并不能保证您的查询花费更少。 The main idea is that its performance will scale linearly with the number of nodes, but you have to still think about how you're doing things, more so than in normal SQL.主要思想是它的性能将随着节点数量线性扩展,但您仍然必须考虑如何做事,比在普通 SQL 中更多。

First thing to check is if the source table is partitioned by time.首先要检查表是否按时间分区。 If not, it should as you'd be reading the whole table every single time.如果没有,它应该就像您每次都阅读整个表格一样。 Second, you're calculating the max as well every time, also, on the whole destination table.其次,您每次也在整个目标表上计算最大值。 You could make it a lot faster if you just calculate the max on the last partition, so change this如果你只计算最后一个分区的最大值,你可以让它更快,所以改变这个

JOIN (select max(load_ts) max_load_ts from target_temp) mt

to this (you didn't write the partition column so I am going to assume it's called 'dt'对此(你没有写分区列,所以我假设它被称为“dt”

JOIN (select max(load_ts) max_load_ts from target_temp WHERE dt=PREVIOUS_DATA_DT) mt

since we know the max load_ts is going to be in the last partition.因为我们知道最大 load_ts 将在最后一个分区中。

Otherwise, it's hard to help without knowing the structure of the source table, and, like somebody else commented, the sizes of the two tables.否则,如果不知道源表的结构,以及两个表的大小,就像其他人评论的那样,很难提供帮助。

JOIN is slower than variable in the WHERE clause. JOIN 比 WHERE 子句中的变量慢。 But the main problem with performance here is that your query performs full scan of target table and source table.但是这里的主要性能问题是您的查询执行目标表和源表的完整扫描。 I would recommend:我会推荐:

  1. Query only the latest partition for max(load_ts).仅查询 max(load_ts) 的最新分区。
  2. Enable statistics gathering and usage启用统计信息收集和使用

    set hive.compute.query.using.stats=true; set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true; set hive.stats.autogather=true;

Compute statistics on both tables for columns.计算两个表的列的统计信息。 Statistics will make queries like selecting MAX(partition) or max(ts) executing faster统计信息将使诸如选择 MAX(partition) 或 max(ts) 之类的查询执行得更快

  1. Try to put source partition files into target partition folder instead of INSERT if applicable (target and source tables partitioning and storage format should enable this).如果适用,尝试将源分区文件放入目标分区文件夹而不是 INSERT(目标和源表分区和存储格式应启用此功能)。 It works fine for example for textfile storage format and if source table partition contain only rows>max(target_partition).例如对于文本文件存储格式,如果源表分区仅包含行>max(target_partition),它就可以正常工作。 You can combine both copy files method(for those source partitions that exactly contain rows to be inserting without filtering) and INSERT(for partitions containing mixed data that need to be filtering).您可以结合使用复制文件方法(对于那些完全包含要插入而无需过滤的行的源分区)和 INSERT(对于包含需要过滤的混合数据的分区)。

  2. Hive may be merging your files during INSERT. Hive 可能会在 INSERT 期间合并您的文件。 This merge phase takes additional time and adds additional stage job.此合并阶段需要额外的时间并添加额外的阶段作业。 Check hive.merge.mapredfiles option and try to switch it off.检查 hive.merge.mapredfiles 选项并尝试将其关闭。

  3. And of course use pre-calculated variable instead of join.当然,使用预先计算的变量而不是连接。

Use Cost-Based Optimisation Technique by enabling below properties通过启用以下属性来使用基于成本的优化技术

set hive.cbo.enable=true;
set hive.stats.autogather=true;
set hive.stats.fetch.column.stats=true;
set hive.compute.query.using.stats=true;
set hive.vectorized.execution.enabled=true;
set hive.exec.parallel=true;

Also analyze the table还分析表

ANALYZE TABLE temp_db.source_temp COMPUTE STATISTICS [comma_separated_column_list];
ANALYZE TABLE target_temp PARTITION(DATA_DT) COMPUTE STATISTICS;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM