简体   繁体   English

调整Hive查询的性能

[英]Performance tuning a Hive query

I have a Hive query which is selecting about 30 columns and around 400,000 records and inserting them into another table. 我有一个Hive查询,它选择了大约30列和大约400,000条记录并将它们插入另一个表中。 I have one join in my SQL clause, which is just an inner join. 我在SQL子句中有一个连接,它只是一个内连接。

The query fails because of a Java GC overhead limit exceeded. 由于超出了Java GC开销限制,查询失败。

What's strange is that if I remove the join clause and just select the data from the table (slightly higher volume) then the query works fine. 奇怪的是,如果我删除join子句并只选择表中的数据(略高的音量),那么查询工作正常。

I'm pretty new to Hive. 我对Hive很新。 I can't understand why this join is causing memory exceptions. 我无法理解为什么这个连接导致内存异常。

Is there something that I should be aware of with regards to how I write Hive queries so that they don't cause these issues? 关于如何编写Hive查询以便它们不会导致这些问题,我是否应该注意一些事项? Could anyone explain why the join might cause this issue but selecting a higher volume of data and the same number of columns does not. 任何人都可以解释为什么连接可能会导致此问题,但选择更大的数据量和相同数量的列不会。

Appreciate your thoughts on this. 感谢你对此的看法。 Thanks 谢谢

Depending on the version of Hive and your configuration, the answer to your question may vary. 根据Hive的版本和您的配置,您的问题的答案可能会有所不同。 It would be easier if you could share your exact query along with the create statements of the two tables and an estimate of their sizes. 如果您可以与两个表的create语句共享您的确切查询以及它们的大小估计会更容易。

To get a better understanding of the problem, let's go through how a "regular" inner join works in Hive. 为了更好地理解这个问题,让我们来看看Hive中“常规”内连接的工作原理。

Hive join in MapReduce: Hive加入MapReduce:

Here is a simplified description of how an inner join in Hive gets compiled to MapReduce. 以下是如何将Hive中的内部联接编译为MapReduce的简化说明。 In general, if you have two tables t1 and t2 with a join query like: 通常,如果您有两个表t1和t2,并且连接查询如下:

SELECT
   t1.key, t1.value, t2.value
FROM
   t1
   JOIN
   t2 (ON t1.key = t2.key);

Where, t1 has the following contents: 其中,t1具有以下内容:

k_1    v1_1
k_2    v1_2
k_3    v1_3    

Where, t2 has the following contents: 其中,t2具有以下内容:

k_2    v2_2
k_3    v2_3
k_4    v2_4    

We would expect the join result to be 我们希望连接结果是

k_2    v1_2    v2_2
k_3    v1_3    v2_3

Assuming the tables are stored on HDFS, their contents will be split up into File Splits. 假设表存储在HDFS上,它们的内容将被拆分为File Splits。 A mapper will take a file split as input and emit out the key as the key column of the table and the value as the composite of the value column of the table and a flag (representing which table the record is from ie t1 or t2). 映射器将文件拆分为输入,并将密钥作为表的键列发出,值作为表的值列和标志的复合(表示记录来自哪个表,即t1或t2) 。

For t1: 对于t1:

k_1, <v1_1, t1>
k_2, <v1_2, t1>
k_3, <v1_3, t1>

For t2: 对于t2:

k_2, <v2_2, t2>
k_3, <v2_3, t2>
k_4, <v2_4, t2>

Now, these emitted out records go through the shuffle phase where all the records with the same keys are grouped together and sent to a reducer. 现在,这些发出的记录经过洗牌阶段,其中具有相同键的所有记录被组合在一起并被发送到减速器。 The context of each reduce operation is one key and a list containing all the values corresponding to that key. 每个reduce操作的上下文是一个键和一个包含与该键对应的所有值的列表。 In practice, one reducer will perform several reduce operations. 在实践中,一个减速器将执行几个减速操作。

In the above example, we would get the following groupings: 在上面的示例中,我们将获得以下分组:

k_1, <<v1_1, t1>>
k_2, <<v1_2, t1>, <v2_2, t2>>
k_3, <<v1_3, t1>, <v2_3, t2>>
k_4, <<v2_4, t2>>

Here is what happens in the reducer. 以下是减速机中发生的情况。 For each of the values in the list of values, the reducer will perform a multiplication if the values correspond to different tables. 对于值列表中的每个值,如果值对应于不同的表,则reducer将执行乘法运算。

For k_1, there is no value from t2 and nothing is emitted. 对于k_1,t2没有值,也没有发出任何值。

For k_2, a multiplication of values is emitted - k_2, v1_2, v2_2 (since there is one value from each table, 1x1 = 1) 对于k_2,发出值的乘法 - k_2,v1_2,v2_2(因为每个表中有一个值,1x1 = 1)

For k_3, a multiplication of values is emitted - k_3, v1_3, v2_3 (since there is one value from each table, 1x1 = 1) 对于k_3,发出值的乘法 - k_3,v1_3,v2_3(因为每个表中有一个值,1x1 = 1)

For k_4, there is no value from t1 and nothing is emitted. 对于k_4,t1没有值,也没有发出任何值。 Hence you obtain the result that you expected from your inner join. 因此,您可以获得内部联接所期望的结果。

Ok, so what do I do? 好的,我该怎么办?

  1. It's possible that there is skew in your data. 您的数据可能存在偏差。 In other words, when the reducer gets the data, the list of values corresponding to some key is very long which causes an error. 换句话说,当reducer获取数据时,对应于某个键的值列表非常长,这会导致错误。 To alleviate the problem, you may try bumping up the memory available to your JVM. 要解决此问题,您可以尝试增加JVM可用的内存。 You can do so by setting mapred.child.java.opts to a value like -Xmx512M in your hive-site.xml. 您可以通过设置这样做mapred.child.java.opts到的值等-Xmx512M在蜂房的site.xml。 You can query the present value of this parameter by doing set mapred.child.java.opts; 您可以通过set mapred.child.java.opts;查询此参数的当前值set mapred.child.java.opts; in your Hive shell. 在你的Hive shell中。

  2. You can try using alternatives to "regular" join, eg map join. 您可以尝试使用“常规”连接的替代方法,例如地图连接。 The above explanation of joins applies to regular joins where the joining happens in reducers. 上面的连接说明适用于常规连接,其中连接发生在reducers中。 Depending on the version of Hive you are using, Hive may automatically be able to convert a regular join to map join which is faster (because the join happens in map phase). 根据您使用的Hive的版本,Hive可以自动将常规连接转换为更快的地图连接(因为连接发生在地图阶段)。 To enable the optimization, set hive.auto.convert.join to true . 要启用优化,请将hive.auto.convert.join设置为true This property was introduced in Hive 0.7 该属性是在Hive 0.7中引入的

  3. In addition to setting hive.auto.convert.join to true , you may also set hive.optimize.skewjoin to true . 除了将hive.auto.convert.join设置为true ,您还可以将hive.optimize.skewjoin设置为true This will work around the skew in your data problem described in 1. 这将解决1中描述的数据问题的偏差。

Many thanks for the response Mark. 非常感谢Mark的回应。 Much appreciated. 非常感激。

After many hours I eventually found out that the order of tables in the the join statement makes a difference. 几个小时后,我终于发现join语句中表的顺序有所不同。 For optimum performance and memory management the last join should be the largest table. 为了获得最佳性能和内存管理,最后一个连接应该是最大的表。

Changing the order of my tables in the join statement fixed the issue. 在join语句中更改表的顺序解决了问题。

See Largest Table Last at http://hive.apache.org/docs/r0.9.0/language_manual/joins.html 请参阅最新的表格http://hive.apache.org/docs/r0.9.0/language_manual/joins.html

Your explanation above is very useful as well. 您上面的解释也非常有用。 Many Thanks 非常感谢

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM