Hive连接优化

Question

我有两组数据都存储在一个S3存储桶中，我需要在Hive中处理它并将输出存储回S3。 每个数据集的样本行如下：

DataSet 1: {"requestId":"TADS6152JHGJH5435", "customerId":"ASJHAGSJH","sessionId":"172356126"}

DataSet2: {"requestId":"TADS6152JHGJH5435","userAgent":"Mozilla"}

我需要基于requestId连接这两个数据集，并输出一个组合行：

Output:  {"requestId":"TADS6152JHGJH5435", "customerId":"ASJHAGSJH","sessionId":"172356126","userAgent":"Mozilla"}

数据集1中的requestIds是数据集2中请求的正确子集。我使用LEFT OUTER JOIN来获取输出。 这是我的Hive脚本的简化版本：

CREATE EXTERNAL TABLE dataset1 (
     requestId string,
     customerId string,
     sessionId string
 )
LOCATION 's3://path_to_dataset1/';

CREATE EXTERNAL TABLE dataset2 (
     requestId string,
     userAgent string
 )
LOCATION 's3://path_to_dataset2/';

CREATE EXTERNAL TABLE output (
     requestId string,
     customerId string,
     sessionId string,
     userAgent string
 )
LOCATION 's3://path_to_output/';

INSERT OVERWRITE TABLE output
  SELECT d1.requestId, d1.customerId, d1.sessionId, d2.userAgent
  FROM dataset1 d1 LEFT OUTER JOIN dataset2 d2
  ON (d1.requestId=d2.requestId);

我的问题是：

是否有机会优化此加入？ 我可以使用表的分区/分区来更快地运行连接吗？ 我在我的脚本中将hive.auto.convert.join设置为true 。 我应该设置哪些其他蜂巢属性以获得更好的上述查询性能？

Answer 1

1. Optimize Joins

我们可以通过启用自动转换贴图连接并启用偏斜连接的优化来提高连接的性能。

Auto Map Joins

使用小表加入大表时，Auto Map-Join是一个非常有用的功能。 如果我们启用此功能，小表将保存在每个节点的本地缓存中，然后在Map阶段与大表连接。 启用自动映射连接有两个优点。 首先，将一个小表加载到缓存中将节省每个数据节点的读取时间。 其次，它避免了Hive查询中的偏斜连接，因为已经在每个数据块的Map阶段完成了连接操作。

Skew Joins

我们可以通过hive shell或hive-site.xml文件中的SET命令将hive.optimize.skewjoin属性设置为true来启用偏斜连接的优化，即不平衡连接。

  <property>
    <name>hive.optimize.skewjoin</name>
    <value>true</value>
    <description>
      Whether to enable skew join optimization. 
      The algorithm is as follows: At runtime, detect the keys with a large skew. Instead of
      processing those keys, store them temporarily in an HDFS directory. In a follow-up map-reduce
      job, process those skewed keys. The same key need not be skewed for all the tables, and so,
      the follow-up map-reduce job (for the skewed keys) would be much faster, since it would be a
      map-join.
    </description>
  </property>
  <property>
    <name>hive.skewjoin.key</name>
    <value>100000</value>
    <description>
      Determine if we get a skew key in join. If we see more than the specified number of rows with the same key in join operator,
      we think the key as a skew join key. 
    </description>
  </property>
  <property>
    <name>hive.skewjoin.mapjoin.map.tasks</name>
    <value>10000</value>
    <description>
      Determine the number of map task used in the follow up map join job for a skew join.
      It should be used together with hive.skewjoin.mapjoin.min.split to perform a fine grained control.
    </description>
  </property>
  <property>
    <name>hive.skewjoin.mapjoin.min.split</name>
    <value>33554432</value>
    <description>
      Determine the number of map task at most used in the follow up map join job for a skew join by specifying 
      the minimum split size. It should be used together with hive.skewjoin.mapjoin.map.tasks to perform a fine grained control.
    </description>
  </property>

2. Enable Bucketed Map Joins

如果表被特定列分隔并且这些表正在连接中使用，那么我们可以启用bucketed map join来提高性能。

  <property>
    <name>hive.optimize.bucketmapjoin</name>
    <value>true</value>
    <description>Whether to try bucket mapjoin</description>
  </property>
  <property>
    <name>hive.optimize.bucketmapjoin.sortedmerge</name>
    <value>true</value>
    <description>Whether to try sorted bucket merge map join</description>
  </property>

。

3. Enable Tez Execution Engine

我们不是在古老的Map-reduce引擎上运行Hive查询，而是通过在Tez执行引擎上运行，将hive查询的性能提升至少100％到300％。 我们可以从hive shell启用具有以下属性的Tez引擎。

hive> set hive.execution.engine=tez;

。

4. Enable Parallel Execution

Hive将查询转换为一个或多个阶段。 阶段可以是MapReduce阶段，抽样阶段，合并阶段，限制阶段。 默认情况下，Hive一次执行一个这些阶段。 特定的工作可能包含一些不依赖于彼此并且可以执行的阶段

并行，可能允许整个工作更快地完成。 可以通过设置以下属性来启用并行执行。

  <property>
    <name>hive.exec.parallel</name>
    <value>true</value>
    <description>Whether to execute jobs in parallel</description>
  </property>
  <property>
    <name>hive.exec.parallel.thread.number</name>
    <value>8</value>
    <description>How many jobs at most can be executed in parallel</description>
  </property>

。

5. Enable Vectorization

矢量化功能仅在hive-0.13.1版本中首次引入配置单元。 通过矢量化查询执行，我们可以通过一次1024行而不是单行执行它们来提高扫描，聚合，过滤器和连接等操作的性能。

我们可以通过在hive shell或hive-site.xml文件中设置以下三个属性来启用向量化查询执行。

hive> set hive.vectorized.execution.enabled = true;
hive> set hive.vectorized.execution.reduce.enabled = true;
hive> set hive.vectorized.execution.reduce.groupby.enabled = true;

。

6. Enable Cost Based Optimization

最近的Hive版本提供了基于成本的优化功能，可以根据查询成本实现进一步优化，从而产生可能不同的决策：如何订购联接，执行哪种类型的联接，并行度等。

可以通过在hive-site.xml文件中设置以下属性来启用基于成本的优化。

  <property>
    <name>hive.cbo.enable</name>
    <value>true</value>
    <description>Flag to control enabling Cost Based Optimizations using Calcite framework.</description>
  </property>
  <property>
    <name>hive.compute.query.using.stats</name>
    <value>true</value>
    <description>
      When set to true Hive will answer a few queries like count(1) purely using stats
      stored in metastore. For basic stats collection turn on the config hive.stats.autogather to true.
      For more advanced stats collection need to run analyze table queries.
    </description>
  </property>
  <property>
    <name>hive.stats.fetch.partition.stats</name>
    <value>true</value>
    <description>
      Annotation of operator tree with statistics information requires partition level basic
      statistics like number of rows, data size and file size. Partition statistics are fetched from
      metastore. Fetching partition statistics for each needed partition can be expensive when the
      number of partitions is high. This flag can be used to disable fetching of partition statistics
      from metastore. When this flag is disabled, Hive will make calls to filesystem to get file sizes
      and will estimate the number of rows from row schema.
    </description>
  </property>
  <property>
    <name>hive.stats.fetch.column.stats</name>
    <value>true</value>
    <description>
      Annotation of operator tree with statistics information requires column statistics.
      Column statistics are fetched from metastore. Fetching column statistics for each needed column
      can be expensive when the number of columns is high. This flag can be used to disable fetching
      of column statistics from metastore.
    </description>
  </property>
  <property>
    <name>hive.stats.autogather</name>
    <value>true</value>
    <description>A flag to gather statistics automatically during the INSERT OVERWRITE command.</description>
  </property>
  <property>
    <name>hive.stats.dbclass</name>
    <value>fs</value>
    <description>
      Expects one of the pattern in [jdbc(:.*), hbase, counter, custom, fs].
      The storage that stores temporary Hive statistics. In filesystem based statistics collection ('fs'), 
      each task writes statistics it has collected in a file on the filesystem, which will be aggregated 
      after the job has finished. Supported values are fs (filesystem), jdbc:database (where database 
      can be derby, mysql, etc.), hbase, counter, and custom as defined in StatsSetupConst.java.
    </description>
  </property>

Hive连接优化

问题描述

1 个解决方案

解决方案1
16 2015-09-03 10:26:22

Hive连接优化

问题描述

1 个解决方案

解决方案1 16 2015-09-03 10:26:22

解决方案1
16 2015-09-03 10:26:22