简体   繁体   English

谓词下推 vs On 子句

[英]Predicate Pushdown vs On Clause

When performing a join in Hive and then filtering the output with a where clause, the Hive compiler will try to filter data before the tables are joined.在 Hive 中执行连接然后使用 where 子句过滤输出时,Hive 编译器将尝试在连接表之前过滤数据。 This is known as predicate pushdown ( http://allabouthadoop.net/what-is-predicate-pushdown-in-hive/ )这称为谓词下推( http://allabouthadoop.net/what-is-predicate-pushdown-in-hive/

For example:例如:

SELECT * FROM a JOIN b ON a.some_id=b.some_other_id WHERE a.some_name=6

Rows from table a which have some_name = 6 will be filtered before performing the join, if push down predicates are enabled(hive.optimize.ppd).如果启用了下推谓词(hive.optimize.ppd),则表 a 中 some_name = 6 的行将在执行连接之前被过滤。

However, I have also learned recently that there is another way of filtering data from a table before joining it with another table( https://vinaynotes.wordpress.com/2015/10/01/hive-tips-joins-occur-before-where-clause/ ).但是,我最近还了解到,在将表与另一个表连接之前,还有另一种过滤数据的方法( https://vinaynotes.wordpress.com/2015/10/01/hive-tips-joins-occur-before -where-子句/ )。

One can provide the condition in the ON clause, and table a will be filtered before the join is performed可以在ON子句中提供条件,表a将在执行联接之前进行过滤

For example:例如:

SELECT * FROM a JOIN b  ON a.some_id=b.some_other_id AND a.some_name=6

Do both of these provide the predicate pushdown optimization?这两者都提供谓词下推优化吗?

Thank you谢谢

Both are valid and in case of INNER JOIN and PPD both will work the same.两者都是有效的,在 INNER JOIN 和 PPD 的情况下,两者的工作方式相同。 But these methods works differently in case of OUTER JOINS但是这些方法在 OUTER JOINS 的情况下工作方式不同

ON join condition works before join. ON 加入条件在加入之前起作用。

WHERE is applied after join.加入后应用WHERE。

Optimizer decides is Predicate push-down applicable or not and it may work, but in case of LEFT JOIN for example with WHERE filter on right table, the WHERE filter优化器决定谓词下推是否适用,它可能会起作用,但在 LEFT JOIN 的情况下,例如右表上的 WHERE 过滤器WHERE 过滤器

SELECT * FROM a 
             LEFT JOIN b ON a.some_id=b.some_other_id 
 WHERE b.some_name=6 --Right table filter

will restrict NULLs, and LEFT JOIN will be transformed into INNER JOIN , because if b.some_name=6, it cannot be NULL.将限制 NULL, LEFT JOIN将被转换为INNER JOIN ,因为如果 b.some_name=6,它不能为 NULL。

And PPD does not change this behavior.而 PPD 不会改变这种行为。

You can still do LEFT JOIN with WHERE filter if you add additional OR condition allowing NULLs in the right table:如果在右表中添加允许 NULL 的额外 OR 条件,您仍然可以使用 WHERE 过滤器执行 LEFT JOIN:

SELECT * FROM a 
             LEFT JOIN b ON a.some_id=b.some_other_id 
 WHERE b.some_name=6 OR b.some_other_id IS NULL --allow not joined records

And if you have multiple joins with many such filtering conditions the logic like this makes your query difficult to understand and error prune.如果您有多个连接和许多这样的过滤条件,这样的逻辑会使您的查询难以理解和错误修剪。

LEFT JOIN with ON filter does not require additional OR condition because it filters right table before join, this query works as expected and easy to understand: LEFT JOIN with ON filter 不需要额外的 OR 条件,因为它在 join 之前过滤了右表,这个查询按预期工作并且易于理解:

SELECT * FROM a 
             LEFT JOIN b ON a.some_id=b.some_other_id and b.some_name=6

PPD still works for ON filter and if table b is ORC, PPD will push the predicate to the lowest possible level to the ORC reader and will use built-in ORC indexes for filtering on three levels: rows, stripes and files. PPD 仍然适用于 ON 过滤器,如果表 b 是 ORC,PPD 会将谓词推送到尽可能低的级别给 ORC 阅读器,并将使用内置的 ORC 索引在三个级别进行过滤:行、条带和文件。

More on the same topic and some tests: https://stackoverflow.com/a/46843832/2700344更多关于同一主题和一些测试: https : //stackoverflow.com/a/46843832/2700344

So, PPD or not PPD, better use explicit ANSI syntax with ON condition and ON filtering if possible to keep the query as simple as possible and avoid converting to INNER JOIN unintentionally.因此,无论是否使用 PPD,最好使用带有 ON 条件和 ON 过滤的显式 ANSI 语法,以尽可能保持查询简单并避免无意中转换为 INNER JOIN。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM