Hive連接優化

Question

我有兩組數據都存儲在一個S3存儲桶中，我需要在Hive中處理它並將輸出存儲回S3。 每個數據集的樣本行如下：

DataSet 1: {"requestId":"TADS6152JHGJH5435", "customerId":"ASJHAGSJH","sessionId":"172356126"}

DataSet2: {"requestId":"TADS6152JHGJH5435","userAgent":"Mozilla"}

我需要基於requestId連接這兩個數據集，並輸出一個組合行：

Output:  {"requestId":"TADS6152JHGJH5435", "customerId":"ASJHAGSJH","sessionId":"172356126","userAgent":"Mozilla"}

數據集1中的requestIds是數據集2中請求的正確子集。我使用LEFT OUTER JOIN來獲取輸出。 這是我的Hive腳本的簡化版本：

CREATE EXTERNAL TABLE dataset1 (
     requestId string,
     customerId string,
     sessionId string
 )
LOCATION 's3://path_to_dataset1/';

CREATE EXTERNAL TABLE dataset2 (
     requestId string,
     userAgent string
 )
LOCATION 's3://path_to_dataset2/';

CREATE EXTERNAL TABLE output (
     requestId string,
     customerId string,
     sessionId string,
     userAgent string
 )
LOCATION 's3://path_to_output/';

INSERT OVERWRITE TABLE output
  SELECT d1.requestId, d1.customerId, d1.sessionId, d2.userAgent
  FROM dataset1 d1 LEFT OUTER JOIN dataset2 d2
  ON (d1.requestId=d2.requestId);

我的問題是：

是否有機會優化此加入？ 我可以使用表的分區/分區來更快地運行連接嗎？ 我在我的腳本中將hive.auto.convert.join設置為true 。 我應該設置哪些其他蜂巢屬性以獲得更好的上述查詢性能？

Answer 1

1. Optimize Joins

我們可以通過啟用自動轉換貼圖連接並啟用偏斜連接的優化來提高連接的性能。

Auto Map Joins

使用小表加入大表時，Auto Map-Join是一個非常有用的功能。 如果我們啟用此功能，小表將保存在每個節點的本地緩存中，然后在Map階段與大表連接。 啟用自動映射連接有兩個優點。 首先，將一個小表加載到緩存中將節省每個數據節點的讀取時間。 其次，它避免了Hive查詢中的偏斜連接，因為已經在每個數據塊的Map階段完成了連接操作。

Skew Joins

我們可以通過hive shell或hive-site.xml文件中的SET命令將hive.optimize.skewjoin屬性設置為true來啟用偏斜連接的優化，即不平衡連接。

  <property>
    <name>hive.optimize.skewjoin</name>
    <value>true</value>
    <description>
      Whether to enable skew join optimization. 
      The algorithm is as follows: At runtime, detect the keys with a large skew. Instead of
      processing those keys, store them temporarily in an HDFS directory. In a follow-up map-reduce
      job, process those skewed keys. The same key need not be skewed for all the tables, and so,
      the follow-up map-reduce job (for the skewed keys) would be much faster, since it would be a
      map-join.
    </description>
  </property>
  <property>
    <name>hive.skewjoin.key</name>
    <value>100000</value>
    <description>
      Determine if we get a skew key in join. If we see more than the specified number of rows with the same key in join operator,
      we think the key as a skew join key. 
    </description>
  </property>
  <property>
    <name>hive.skewjoin.mapjoin.map.tasks</name>
    <value>10000</value>
    <description>
      Determine the number of map task used in the follow up map join job for a skew join.
      It should be used together with hive.skewjoin.mapjoin.min.split to perform a fine grained control.
    </description>
  </property>
  <property>
    <name>hive.skewjoin.mapjoin.min.split</name>
    <value>33554432</value>
    <description>
      Determine the number of map task at most used in the follow up map join job for a skew join by specifying 
      the minimum split size. It should be used together with hive.skewjoin.mapjoin.map.tasks to perform a fine grained control.
    </description>
  </property>

2. Enable Bucketed Map Joins

如果表被特定列分隔並且這些表正在連接中使用，那么我們可以啟用bucketed map join來提高性能。

  <property>
    <name>hive.optimize.bucketmapjoin</name>
    <value>true</value>
    <description>Whether to try bucket mapjoin</description>
  </property>
  <property>
    <name>hive.optimize.bucketmapjoin.sortedmerge</name>
    <value>true</value>
    <description>Whether to try sorted bucket merge map join</description>
  </property>

。

3. Enable Tez Execution Engine

我們不是在古老的Map-reduce引擎上運行Hive查詢，而是通過在Tez執行引擎上運行，將hive查詢的性能提升至少100％到300％。 我們可以從hive shell啟用具有以下屬性的Tez引擎。

hive> set hive.execution.engine=tez;

。

4. Enable Parallel Execution

Hive將查詢轉換為一個或多個階段。 階段可以是MapReduce階段，抽樣階段，合並階段，限制階段。 默認情況下，Hive一次執行一個這些階段。 特定的工作可能包含一些不依賴於彼此並且可以執行的階段

並行，可能允許整個工作更快地完成。 可以通過設置以下屬性來啟用並行執行。

  <property>
    <name>hive.exec.parallel</name>
    <value>true</value>
    <description>Whether to execute jobs in parallel</description>
  </property>
  <property>
    <name>hive.exec.parallel.thread.number</name>
    <value>8</value>
    <description>How many jobs at most can be executed in parallel</description>
  </property>

。

5. Enable Vectorization

矢量化功能僅在hive-0.13.1版本中首次引入配置單元。 通過矢量化查詢執行，我們可以通過一次1024行而不是單行執行它們來提高掃描，聚合，過濾器和連接等操作的性能。

我們可以通過在hive shell或hive-site.xml文件中設置以下三個屬性來啟用向量化查詢執行。

hive> set hive.vectorized.execution.enabled = true;
hive> set hive.vectorized.execution.reduce.enabled = true;
hive> set hive.vectorized.execution.reduce.groupby.enabled = true;

。

6. Enable Cost Based Optimization

最近的Hive版本提供了基於成本的優化功能，可以根據查詢成本實現進一步優化，從而產生可能不同的決策：如何訂購聯接，執行哪種類型的聯接，並行度等。

可以通過在hive-site.xml文件中設置以下屬性來啟用基於成本的優化。

  <property>
    <name>hive.cbo.enable</name>
    <value>true</value>
    <description>Flag to control enabling Cost Based Optimizations using Calcite framework.</description>
  </property>
  <property>
    <name>hive.compute.query.using.stats</name>
    <value>true</value>
    <description>
      When set to true Hive will answer a few queries like count(1) purely using stats
      stored in metastore. For basic stats collection turn on the config hive.stats.autogather to true.
      For more advanced stats collection need to run analyze table queries.
    </description>
  </property>
  <property>
    <name>hive.stats.fetch.partition.stats</name>
    <value>true</value>
    <description>
      Annotation of operator tree with statistics information requires partition level basic
      statistics like number of rows, data size and file size. Partition statistics are fetched from
      metastore. Fetching partition statistics for each needed partition can be expensive when the
      number of partitions is high. This flag can be used to disable fetching of partition statistics
      from metastore. When this flag is disabled, Hive will make calls to filesystem to get file sizes
      and will estimate the number of rows from row schema.
    </description>
  </property>
  <property>
    <name>hive.stats.fetch.column.stats</name>
    <value>true</value>
    <description>
      Annotation of operator tree with statistics information requires column statistics.
      Column statistics are fetched from metastore. Fetching column statistics for each needed column
      can be expensive when the number of columns is high. This flag can be used to disable fetching
      of column statistics from metastore.
    </description>
  </property>
  <property>
    <name>hive.stats.autogather</name>
    <value>true</value>
    <description>A flag to gather statistics automatically during the INSERT OVERWRITE command.</description>
  </property>
  <property>
    <name>hive.stats.dbclass</name>
    <value>fs</value>
    <description>
      Expects one of the pattern in [jdbc(:.*), hbase, counter, custom, fs].
      The storage that stores temporary Hive statistics. In filesystem based statistics collection ('fs'), 
      each task writes statistics it has collected in a file on the filesystem, which will be aggregated 
      after the job has finished. Supported values are fs (filesystem), jdbc:database (where database 
      can be derby, mysql, etc.), hbase, counter, and custom as defined in StatsSetupConst.java.
    </description>
  </property>

Hive連接優化

問題描述

1 個解決方案

解決方案1
16 2015-09-03 10:26:22

Hive連接優化

問題描述

1 個解決方案

解決方案1 16 2015-09-03 10:26:22

解決方案1
16 2015-09-03 10:26:22