如何调整配置单元以查询元数据？

Question

如果我在具有某些分区列的表上运行下面的配置单元查询，我想确保配置单元不进行全表扫描，而只是从元数据本身中找出结果。 有什么办法可以做到这一点？

Select max(partitioned_col) from hive_table ;

现在，当我运行此查询时，它的启动图会减少任务，并且可以确定它在进行数据扫描，同时可以很好地从元数据本身中找出值。

Answer 1

每次更改数据时都要计算表统计信息。

ANALYZE TABLE hive_table PARTITION(partitioned_col) COMPUTE STATISTICS FOR COLUMNS;

启用CBO和统计信息自动收集：

set hive.cbo.enable=true;
set hive.stats.autogather=true;

使用以下设置可以使用统计信息启用CBO：

set hive.compute.query.using.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.stats.fetch.column.stats=true;

如果没有帮助， 我建议您采用这种方法快速找到最后一个分区：使用表位置中的shell脚本解析最大分区键。 下面的命令将打印所有表文件夹路径，排序，采用最新排序，采用最后一个子文件夹名称，解析分区文件夹名称并提取值。 您只需要初始化TABLE_DIR变量并将the number of partition subfolder in the path放在the number of partition subfolder in the path ：

last_partition=$(hadoop fs -ls $TABLE_DIR/* | awk '{ print $8 }' | sort -r | head -n1 | cut -d / -f [number of partition subfolder in the path here] | cut -d = -f 2

然后使用$last_partition变量传递给您的脚本为

  hive -hiveconf last_partition="$last_partition" -f your_script.hql

如何调整配置单元以查询元数据？

问题描述

1 个解决方案

解决方案1
4 已采纳 2017-01-31 07:54:49

如何调整配置单元以查询元数据？

问题描述

1 个解决方案

解决方案1 4 已采纳 2017-01-31 07:54:49

解决方案1
4 已采纳 2017-01-31 07:54:49