比较 Hive 查询与不同连接顺序的效率

Question

Consider the following two queries in Hive:考虑 Hive 中的以下两个查询：

SELECT
    *
FROM
    A
INNER JOIN 
    B
INNER JOIN
    C
ON 
    A.COL = B.COL
AND A.COL = C.COL

and和

SELECT
    *
FROM
    A
INNER JOIN
    B
ON
    A.COL = B.COL
INNER JOIN
    C
ON
    A.COL = C.COL

Question : Are the two queries computationally same or different?问题：这两个查询在计算上是相同的还是不同的？ In other words, to get the fastest results should I prefer to write one versus the other, or it doesn't matter?换句话说，为了获得最快的结果，我应该更喜欢写一个而不是另一个，还是没关系？ Thanks.谢谢。

Answer 1

On Hive 1.2, also tested on Hive 2.3, both on Tez, the optimizer is intelligent enough to derive ON condition for join with table B and performs two INNER JOINs each with correct it's own ON condition.在 Hive 1.2 上，也在 Hive 2.3 上进行了测试，两者都在 Tez 上，优化器足够智能，可以得出与表 B 连接的 ON 条件，并执行两个 INNER JOIN，每个都具有正确的自己的 ON 条件。

Checked on simple query检查简单查询

with A as (
select stack(3,1,2,3) as id
),
B as (
select stack(3,1,2,3) as id
),
C as (
select stack(3,1,2,3) as id
)

select * from A 
inner join B
inner join C
ON A.id = B.id AND A.id = C.id

Explain command shows that both joins are executed as map-join on single mapper and each join has it's own join condition.解释命令显示两个连接都在单个映射器上作为 map-join 执行，并且每个连接都有自己的连接条件。 This is explain output:这是解释output：

Map 1 File Output Operator [FS_17] Map Join Operator [MAPJOIN_27] (rows=1 width=12) Conds: FIL_24.col0=RS_12.col0(Inner) , FIL_24.col0=RS_14.col0(Inner) ,HybridGraceHashJoin:true,Output:["_col0","_col1","_col2"] Map 1 File Output Operator [FS_17] Map Join Operator [MAPJOIN_27] (rows=1 width=12) Conds: FIL_24.col0=RS_12.col0(Inner) , FIL_24.col0=RS_14.col0(Inner) ,HybridGraceHashJoin:true, Output:["_col0","_col1","_col2"]

First I thought that it will be CROSS join with table B in first query, then join with C will reduce the dataset, but both queries work the same(the same plan, the same execution), thanks to the optimizer.首先我认为它会在第一个查询中与表 B 交叉连接，然后与 C 连接会减少数据集，但两个查询的工作方式相同（相同的计划，相同的执行），这要归功于优化器。

Also I tested the same with map-join switched off ( set hive.auto.convert.join=false; ) and also got exactly the same plan for both queries.此外，我在关闭 map-join ( set hive.auto.convert.join=false; ) 的情况下进行了相同的测试，并且两个查询的计划也完全相同。 I did not test it for really big tables, you better double-check.我没有为真正的大桌子测试它，你最好仔细检查一下。

So, computationally both are the same on Hive 1.2 and Hive 2.3 for map-join and merge join on reducer因此，在 Hive 1.2 和 Hive 2.3 上，reducer 上的 map-join 和 merge join 在计算上都是相同的

比较 Hive 查询与不同连接顺序的效率

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-12-10 17:38:49

比较 Hive 查询与不同连接顺序的效率

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-12-10 17:38:49

解决方案1
1 已采纳 2020-12-10 17:38:49