[英]Hive how to improve my query performance?
Is there any ways to improve the HQL performance?有什么方法可以提高 HQL 性能? I have a query like this:
我有一个这样的查询:
with
tmp_a as (
SELECT * FROM `zhihu.answer` where ym in (select distinct(ym) from zhihu.answer_increment)
),
-- the result of subquery select distinct(ym) from zhihu.answer_increment is 201806
-- the rows of tmp_a are 1,790,000
tmp1 as (
select a.* from tmp_a a
LEFT JOIN `zhihu.answer_increment` b
ON a.answer_id = b.answer_id
AND a.insert_time = b.insert_time
WHERE b.answer_id IS NULL)
insert overwrite table zhihu.answer partition(ym)
select * from tmp1
UNION ALL
SELECT *
FROM `zhihu.answer_increment` t
Table information:表信息:
Rows of `zhihu.answer` are about 10 millions. and was partitioned by `ym`, that means year and month,
and was clustered by `answer_id` into 256 buckets.
Rows of `zhihu.answer_increment` are about 100 thousands.
the data structure of these two tables are the same.
The purpose of above query is to merge the fresh created data ( answer_increment table
) with the history data ( answer table
), and save into Hive.上述查询的目的是将新创建的数据(
answer_increment table
)与历史数据( answer table
)合并,并保存到 Hive 中。
The question is the whole process took approximately 2 hours to run, is there any optimization for above HQL?问题是整个过程运行了大约2个小时,上面的HQL有什么优化吗?
I tried create index to the table, but not sure if this helped:我尝试为表创建索引,但不确定这是否有帮助:
CREATE INDEX insert_time_index ON TABLE zhihu.answer (insert_time) AS 'COMPACT'
WITH DEFERRED REBUILD;
My Hive setting:我的 Hive 设置:
set hive.auto.convert.join=true;
My HQL Explain:我的 HQL 解释:
Explain
STAGE DEPENDENCIES:
Stage-5 is a root stage
Stage-11 depends on stages: Stage-5 , consists of Stage-13, Stage-1
Stage-13 has a backup stage: Stage-1
Stage-10 depends on stages: Stage-13
Stage-9 depends on stages: Stage-1, Stage-10 , consists of Stage-12, Stage-2
Stage-12 has a backup stage: Stage-2
Stage-8 depends on stages: Stage-12
Stage-3 depends on stages: Stage-2, Stage-8
Stage-0 depends on stages: Stage-3
Stage-4 depends on stages: Stage-0
Stage-2
Stage-1
STAGE PLANS:
Stage: Stage-5
Map Reduce
Map Operator Tree:
TableScan
alias: zhihu_answer_increment
filterExpr: ym is not null (type: boolean)
Statistics: Num rows: 101549 Data size: 21426839 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: ym (type: string)
outputColumnNames: ym
Statistics: Num rows: 101549 Data size: 21426839 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
keys: ym (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-11
Conditional Operator
Stage: Stage-13
Map Reduce Local Work
Alias -> Map Local Tables:
_u1-subquery1:tmp1:a:$INTNAME
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
_u1-subquery1:tmp1:a:$INTNAME
TableScan
HashTable Sink Operator
keys:
0 ym (type: string)
1 _col0 (type: string)
Stage: Stage-10
Map Reduce
Map Operator Tree:
TableScan
alias: zhihu.zhihu_answer
filterExpr: ym is not null (type: boolean)
Statistics: Num rows: 76466394 Data size: 16134409134 Basic stats: COMPLETE Column stats: PARTIAL
Map Join Operator
condition map:
Left Semi Join 0 to 1
keys:
0 ym (type: string)
1 _col0 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27
Statistics: Num rows: 4024547 Data size: 7356871916 Basic stats: COMPLETE Column stats: PARTIAL
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Local Work:
Map Reduce Local Work
Stage: Stage-9
Conditional Operator
Stage: Stage-12
Map Reduce Local Work
Alias -> Map Local Tables:
_u1-subquery1:tmp1:b
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
_u1-subquery1:tmp1:b
TableScan
alias: b
Statistics: Num rows: 101549 Data size: 2741823 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 _col3 (type: string), _col15 (type: string)
1 answer_id (type: string), insert_time (type: string)
Stage: Stage-8
Map Reduce
Map Operator Tree:
TableScan
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 _col3 (type: string), _col15 (type: string)
1 answer_id (type: string), insert_time (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col31
Statistics: Num rows: 4427001 Data size: 8092559283 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: _col31 is null (type: boolean)
Statistics: Num rows: 2213500 Data size: 4046278727 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: boolean), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: smallint), _col12 (type: boolean), _col13 (type: int), _col14 (type: string), _col15 (type: string), _col16 (type: boolean), _col17 (type: boolean), _col18 (type: boolean), _col19 (type: boolean), _col20 (type: string), _col21 (type: string), _col22 (type: string), _col23 (type: string), _col24 (type: int), _col25 (type: int), _col26 (type: int), _col27 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27
Statistics: Num rows: 2213500 Data size: 4046278727 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Local Work:
Map Reduce Local Work
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
Union
Statistics: Num rows: 2315049 Data size: 4067705566 Basic stats: COMPLETE Column stats: PARTIAL
Reduce Output Operator
sort order:
Map-reduce partition columns: _col3 (type: string)
Statistics: Num rows: 2315049 Data size: 4067705566 Basic stats: COMPLETE Column stats: PARTIAL
value expressions: _col0 (type: boolean), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: smallint), _col12 (type: boolean), _col13 (type: int), _col14 (type: string), _col15 (type: string), _col16 (type: boolean), _col17 (type: boolean), _col18 (type: boolean), _col19 (type: boolean), _col20 (type: string), _col21 (type: string), _col22 (type: string), _col23 (type: string), _col24 (type: int), _col25 (type: int), _col26 (type: int), _col27 (type: string)
TableScan
alias: t
Statistics: Num rows: 101549 Data size: 21426839 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: admin_closed_comment (type: boolean), answer_content (type: string), answer_created (type: string), answer_id (type: string), answer_updated (type: string), author_headline (type: string), author_id (type: string), author_name (type: string), author_type (type: string), author_url_token (type: string), avatar_url (type: string), badge_num (type: smallint), can_comment (type: boolean), comment_count (type: int), gender (type: string), insert_time (type: string), is_advertiser (type: boolean), is_collapsed (type: boolean), is_copyable (type: boolean), is_org (type: boolean), question_created (type: string), question_id (type: string), question_title (type: string), question_type (type: string), reward_member_count (type: int), reward_total_money (type: int), voteup_count (type: int), ym (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27
Statistics: Num rows: 101549 Data size: 21426839 Basic stats: COMPLETE Column stats: PARTIAL
Union
Statistics: Num rows: 2315049 Data size: 4067705566 Basic stats: COMPLETE Column stats: PARTIAL
Reduce Output Operator
sort order:
Map-reduce partition columns: _col3 (type: string)
Statistics: Num rows: 2315049 Data size: 4067705566 Basic stats: COMPLETE Column stats: PARTIAL
value expressions: _col0 (type: boolean), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: smallint), _col12 (type: boolean), _col13 (type: int), _col14 (type: string), _col15 (type: string), _col16 (type: boolean), _col17 (type: boolean), _col18 (type: boolean), _col19 (type: boolean), _col20 (type: string), _col21 (type: string), _col22 (type: string), _col23 (type: string), _col24 (type: int), _col25 (type: int), _col26 (type: int), _col27 (type: string)
Reduce Operator Tree:
Select Operator
expressions: VALUE._col0 (type: boolean), VALUE._col1 (type: string), VALUE._col2 (type: string), VALUE._col3 (type: string), VALUE._col4 (type: string), VALUE._col5 (type: string), VALUE._col6 (type: string), VALUE._col7 (type: string), VALUE._col8 (type: string), VALUE._col9 (type: string), VALUE._col10 (type: string), VALUE._col11 (type: smallint), VALUE._col12 (type: boolean), VALUE._col13 (type: int), VALUE._col14 (type: string), VALUE._col15 (type: string), VALUE._col16 (type: boolean), VALUE._col17 (type: boolean), VALUE._col18 (type: boolean), VALUE._col19 (type: boolean), VALUE._col20 (type: string), VALUE._col21 (type: string), VALUE._col22 (type: string), VALUE._col23 (type: string), VALUE._col24 (type: int), VALUE._col25 (type: int), VALUE._col26 (type: int), VALUE._col27 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27
Statistics: Num rows: 2315049 Data size: 425969016 Basic stats: COMPLETE Column stats: PARTIAL
File Output Operator
compressed: false
Statistics: Num rows: 2315049 Data size: 425969016 Basic stats: COMPLETE Column stats: PARTIAL
table:
input format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
name: zhihu.zhihu_answer
Stage: Stage-0
Move Operator
tables:
partition:
ym
replace: true
table:
input format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
name: zhihu.zhihu_answer
Stage: Stage-4
Stats-Aggr Operator
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col3 (type: string), _col15 (type: string)
sort order: ++
Map-reduce partition columns: _col3 (type: string), _col15 (type: string)
Statistics: Num rows: 4024547 Data size: 7356871916 Basic stats: COMPLETE Column stats: PARTIAL
value expressions: _col0 (type: boolean), _col1 (type: string), _col2 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: smallint), _col12 (type: boolean), _col13 (type: int), _col14 (type: string), _col16 (type: boolean), _col17 (type: boolean), _col18 (type: boolean), _col19 (type: boolean), _col20 (type: string), _col21 (type: string), _col22 (type: string), _col23 (type: string), _col24 (type: int), _col25 (type: int), _col26 (type: int), _col27 (type: string)
TableScan
alias: b
Statistics: Num rows: 101549 Data size: 2741823 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: answer_id (type: string), insert_time (type: string)
sort order: ++
Map-reduce partition columns: answer_id (type: string), insert_time (type: string)
Statistics: Num rows: 101549 Data size: 2741823 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 _col3 (type: string), _col15 (type: string)
1 answer_id (type: string), insert_time (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col31
Statistics: Num rows: 4427001 Data size: 8092559283 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: _col31 is null (type: boolean)
Statistics: Num rows: 2213500 Data size: 4046278727 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: boolean), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: smallint), _col12 (type: boolean), _col13 (type: int), _col14 (type: string), _col15 (type: string), _col16 (type: boolean), _col17 (type: boolean), _col18 (type: boolean), _col19 (type: boolean), _col20 (type: string), _col21 (type: string), _col22 (type: string), _col23 (type: string), _col24 (type: int), _col25 (type: int), _col26 (type: int), _col27 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27
Statistics: Num rows: 2213500 Data size: 4046278727 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: zhihu.zhihu_answer
filterExpr: ym is not null (type: boolean)
Statistics: Num rows: 76466394 Data size: 16134409134 Basic stats: COMPLETE Column stats: PARTIAL
Reduce Output Operator
key expressions: ym (type: string)
sort order: +
Map-reduce partition columns: ym (type: string)
Statistics: Num rows: 76466394 Data size: 16134409134 Basic stats: COMPLETE Column stats: PARTIAL
value expressions: admin_closed_comment (type: boolean), answer_content (type: string), answer_created (type: string), answer_id (type: string), answer_updated (type: string), author_headline (type: string), author_id (type: string), author_name (type: string), author_type (type: string), author_url_token (type: string), avatar_url (type: string), badge_num (type: smallint), can_comment (type: boolean), comment_count (type: int), gender (type: string), insert_time (type: string), is_advertiser (type: boolean), is_collapsed (type: boolean), is_copyable (type: boolean), is_org (type: boolean), question_created (type: string), question_id (type: string), question_title (type: string), question_type (type: string), reward_member_count (type: int), reward_total_money (type: int), voteup_count (type: int)
TableScan
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Operator Tree:
Join Operator
condition map:
Left Semi Join 0 to 1
keys:
0 ym (type: string)
1 _col0 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27
Statistics: Num rows: 4024547 Data size: 7356871916 Basic stats: COMPLETE Column stats: PARTIAL
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
@leftjoin's HQL Explain: @leftjoin 的 HQL 说明:
Explain
STAGE DEPENDENCIES:
Stage-3 is a root stage
Stage-6 depends on stages: Stage-3 , consists of Stage-7, Stage-1
Stage-7 has a backup stage: Stage-1
Stage-5 depends on stages: Stage-7
Stage-2 depends on stages: Stage-1, Stage-5
Stage-1
Stage-0 depends on stages: Stage-2
STAGE PLANS:
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: zhihu_answer_increment
filterExpr: ym is not null (type: boolean)
Statistics: Num rows: 101549 Data size: 21426839 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: ym (type: string)
outputColumnNames: ym
Statistics: Num rows: 101549 Data size: 21426839 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
keys: ym (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-6
Conditional Operator
Stage: Stage-7
Map Reduce Local Work
Alias -> Map Local Tables:
s:s-subquery1:$INTNAME
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
s:s-subquery1:$INTNAME
TableScan
HashTable Sink Operator
keys:
0 ym (type: string)
1 _col0 (type: string)
Stage: Stage-5
Map Reduce
Map Operator Tree:
TableScan
alias: t
filterExpr: ym is not null (type: boolean)
Statistics: Num rows: 76466394 Data size: 16134409134 Basic stats: COMPLETE Column stats: PARTIAL
Map Join Operator
condition map:
Left Semi Join 0 to 1
keys:
0 ym (type: string)
1 _col0 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27
Statistics: Num rows: 4024547 Data size: 7340773728 Basic stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: _col0 (type: boolean), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: smallint), _col12 (type: boolean), _col14 (type: string), _col15 (type: string), _col16 (type: boolean), _col17 (type: boolean), _col18 (type: boolean), _col19 (type: boolean), _col20 (type: string), _col21 (type: string), _col22 (type: string), _col23 (type: string), _col24 (type: int), _col25 (type: int), _col26 (type: int), _col27 (type: string), 0 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27
Statistics: Num rows: 4024547 Data size: 756614836 Basic stats: COMPLETE Column stats: PARTIAL
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Local Work:
Map Reduce Local Work
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Union
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000;
set hive.execution.engine=tez;
insert overwrite table zhihu.answer partition(ym)
select col1, col2 ... coln, ym, --list all columns
from
(
select col1, col2 ... coln, ym, --list all columns
row_number() over(partition by ym, answer_id, insert_time order by new_flag desc) rn
from
(
select col1, col2 ... coln, ym, --list all columns
0 as new_flag
from zhihu.answer t
where t.ym in (select distinct ym from zhihu.answer_increment)
UNION ALL
select col1, col2 ... coln, ym, --list all columns
1 as new_flag
from zhihu.answer_increment t
)s
)s
where s.rn=1;
Indexes were removed in Hive 3.0 , more details in this Jira: HIVE-18448 在 Hive 3.0 中删除了索引,此 Jira 中的更多详细信息: HIVE-18448
See also this answer: https://stackoverflow.com/a/37744071/2700344另请参阅此答案: https://stackoverflow.com/a/37744071/2700344
Also tune parallelism for better performance: https://stackoverflow.com/a/48487306/2700344还可以调整并行度以获得更好的性能: https://stackoverflow.com/a/48487306/2700344
UPDATE: I studied plans provided by @DennisLi.更新:我研究了@DennisLi 提供的计划。 Some observations:
一些观察:
Join of big table with the whole increment one is performed as a map-join.与整个增量一的大表连接作为映射连接执行。 In this case FULL join approach can be better than UNION ALL+row_number.
在这种情况下,FULL join 方法可能比 UNION ALL+row_number 更好。
join with partition list is already transformed by optimizer to LEFT SEMI JOIN (works as map join also), after filtering there are 4K rows out of total 70M. join with partition list 已经被优化器转换为 LEFT SEMI JOIN(也可以用作 map join),过滤后总共 70M 中有 4K 行。 I recommend to calculate min and max increment ym partition separately and pass them as a parameters using WHERE ym>= ${min_increment_ym} and ym<=${max_increment_ym} In this case partition pruning will filter data efficiently without join.
我建议分别计算最小和最大增量 ym 分区,并使用 WHERE ym>= ${min_increment_ym} 和 ym<=${max_increment_ym} 将它们作为参数传递。在这种情况下,分区修剪将有效地过滤数据而无需连接。 But it can be applied only if applicable to the increment dataset (increment contains single small range of partitions and we can use min and max efficiently) Implementing this will give you the maximum benefit
但它只能在适用于增量数据集的情况下应用(增量包含单个小范围的分区,我们可以有效地使用 min 和 max)实现这一点将为您带来最大的好处
Intermediate compression is not enabled.未启用中间压缩。 Enabling compression may give you a little, but it worth trying
启用压缩可能会给你一点,但值得一试
Recommended approach:推荐方法:
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000;
--check compression influence separately.
--it may give some improvement depending on your data entropy
set hive.exec.compress.intermediate=true;
set mapred.output.compress=true;
set hive.exec.compress.output=true;
insert overwrite table zhihu.answer partition(ym)
select --select increment if exists, old if not exists
case when i.answer_id is not null then i.col1 else t.col1 end as col1,
... --for all columns
case when i.answer_id is not null then i.coln else t.coln end as coln,
--partition is the last one
case when i.answer_id is not null then i.ym else t.ym end as ym
from zhihu.answer t
full join zhihu.answer_increment i
on t.answer_id = i.answer_id
and t.insert_time = i.insert_time
and t.ym=i.ym --check this condition
where t.ym in (select distinct ym from zhihu.answer_increment) --try to implement min and max parameters instead of this if possible (see 2)
--alternatively if you do not want to employ shell, check if you can
--remove the WHERE condition providing ym in the join condition,
--this will allow to get rid of the second join in the plan,
--though partition pruning will work with parameters better
;
Finally the plan will be the most optimal one.最后,该计划将是最优化的。
And you still may need to tune parallelism on mappers and reducers based on your execution logs, see previous answer recommendation而且您可能仍然需要根据执行日志调整映射器和减速器的并行度,请参阅先前的答案推荐
There are few things you can do你可以做的事情很少
Try to use left semi join in your first query.尝试在您的第一个查询中使用左半联接。
As your table zhihu.answer_increment is very small in comparison to you can try giving map side join hints由于您的表 zhihu.answer_increment 与您相比非常小,您可以尝试提供 map 侧连接提示
If your tables is bucketed on same columns and both bucktes are same try with SMB approach如果您的表存储在相同的列上并且两个存储桶相同,请尝试使用 SMB 方法
Try setting below properties尝试设置以下属性
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.