简体   繁体   English

比较 Hive 查询与不同连接顺序的效率

[英]Comparing efficiency of Hive queries with different join orders

Consider the following two queries in Hive:考虑 Hive 中的以下两个查询:

SELECT
    *
FROM
    A
INNER JOIN 
    B
INNER JOIN
    C
ON 
    A.COL = B.COL
AND A.COL = C.COL

and

SELECT
    *
FROM
    A
INNER JOIN
    B
ON
    A.COL = B.COL
INNER JOIN
    C
ON
    A.COL = C.COL

Question : Are the two queries computationally same or different?问题:这两个查询在计算上是相同的还是不同的? In other words, to get the fastest results should I prefer to write one versus the other, or it doesn't matter?换句话说,为了获得最快的结果,我应该更喜欢写一个而不是另一个,还是没关系? Thanks.谢谢。

On Hive 1.2, also tested on Hive 2.3, both on Tez, the optimizer is intelligent enough to derive ON condition for join with table B and performs two INNER JOINs each with correct it's own ON condition.在 Hive 1.2 上,也在 Hive 2.3 上进行了测试,两者都在 Tez 上,优化器足够智能,可以得出与表 B 连接的 ON 条件,并执行两个 INNER JOIN,每个都具有正确的自己的 ON 条件。

Checked on simple query检查简单查询

with A as (
select stack(3,1,2,3) as id
),
B as (
select stack(3,1,2,3) as id
),
C as (
select stack(3,1,2,3) as id
)

select * from A 
inner join B
inner join C
ON A.id = B.id AND A.id = C.id

Explain command shows that both joins are executed as map-join on single mapper and each join has it's own join condition.解释命令显示两个连接都在单个映射器上作为 map-join 执行,并且每个连接都有自己的连接条件。 This is explain output:这是解释output:

Map 1 File Output Operator [FS_17] Map Join Operator [MAPJOIN_27] (rows=1 width=12) Conds: FIL_24.col0=RS_12.col0(Inner) , FIL_24.col0=RS_14.col0(Inner) ,HybridGraceHashJoin:true,Output:["_col0","_col1","_col2"] Map 1 File Output Operator [FS_17] Map Join Operator [MAPJOIN_27] (rows=1 width=12) Conds: FIL_24.col0=RS_12.col0(Inner) , FIL_24.col0=RS_14.col0(Inner) ,HybridGraceHashJoin:true, Output:["_col0","_col1","_col2"]

First I thought that it will be CROSS join with table B in first query, then join with C will reduce the dataset, but both queries work the same(the same plan, the same execution), thanks to the optimizer.首先我认为它会在第一个查询中与表 B 交叉连接,然后与 C 连接会减少数据集,但两个查询的工作方式相同(相同的计划,相同的执行),这要归功于优化器。

Also I tested the same with map-join switched off ( set hive.auto.convert.join=false; ) and also got exactly the same plan for both queries.此外,我在关闭 map-join ( set hive.auto.convert.join=false; ) 的情况下进行了相同的测试,并且两个查询的计划也完全相同。 I did not test it for really big tables, you better double-check.我没有为真正的大桌子测试它,你最好仔细检查一下。

So, computationally both are the same on Hive 1.2 and Hive 2.3 for map-join and merge join on reducer因此,在 Hive 1.2 和 Hive 2.3 上,reducer 上的 map-join 和 merge join 在计算上都是相同的

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM