简体   繁体   English

Hive 如何提高我的查询性能?

[英]Hive how to improve my query performance?

Is there any ways to improve the HQL performance?有什么方法可以提高 HQL 性能? I have a query like this:我有一个这样的查询:

with
tmp_a as (
SELECT * FROM `zhihu.answer` where ym in (select distinct(ym) from zhihu.answer_increment)
),
-- the result of subquery select distinct(ym) from zhihu.answer_increment is 201806
-- the rows of tmp_a are 1,790,000
tmp1 as (
select a.* from tmp_a a
LEFT JOIN `zhihu.answer_increment` b
ON a.answer_id = b.answer_id
AND a.insert_time = b.insert_time
WHERE b.answer_id IS NULL)

insert overwrite table zhihu.answer partition(ym)
select * from tmp1
UNION ALL
SELECT *
FROM `zhihu.answer_increment` t 

Table information:表信息:

Rows of `zhihu.answer` are about 10 millions. and was partitioned by `ym`, that means year and month,
and was clustered by `answer_id` into 256 buckets.
Rows of `zhihu.answer_increment` are about 100 thousands.

the data structure of these two tables are the same.

The purpose of above query is to merge the fresh created data ( answer_increment table ) with the history data ( answer table ), and save into Hive.上述查询的目的是将新创建的数据( answer_increment table )与历史数据( answer table )合并,并保存到 Hive 中。

The question is the whole process took approximately 2 hours to run, is there any optimization for above HQL?问题是整个过程运行了大约2个小时,上面的HQL有什么优化吗?

I tried create index to the table, but not sure if this helped:我尝试为表创建索引,但不确定这是否有帮助:

CREATE INDEX insert_time_index ON TABLE zhihu.answer (insert_time) AS 'COMPACT'
 WITH DEFERRED REBUILD;

My Hive setting:我的 Hive 设置:

set hive.auto.convert.join=true;

My HQL Explain:我的 HQL 解释:

Explain 
STAGE DEPENDENCIES: 
  Stage-5 is a root stage   
  Stage-11 depends on stages: Stage-5 , consists of Stage-13, Stage-1   
  Stage-13 has a backup stage: Stage-1  
  Stage-10 depends on stages: Stage-13  
  Stage-9 depends on stages: Stage-1, Stage-10 , consists of Stage-12, Stage-2  
  Stage-12 has a backup stage: Stage-2  
  Stage-8 depends on stages: Stage-12   
  Stage-3 depends on stages: Stage-2, Stage-8   
  Stage-0 depends on stages: Stage-3    
  Stage-4 depends on stages: Stage-0    
  Stage-2   
  Stage-1   

STAGE PLANS:    
  Stage: Stage-5    
    Map Reduce  
      Map Operator Tree:    
          TableScan 
            alias: zhihu_answer_increment   
            filterExpr: ym is not null (type: boolean)  
            Statistics: Num rows: 101549 Data size: 21426839 Basic stats: COMPLETE Column stats: COMPLETE   
            Select Operator 
              expressions: ym (type: string)    
              outputColumnNames: ym 
              Statistics: Num rows: 101549 Data size: 21426839 Basic stats: COMPLETE Column stats: COMPLETE 
              Group By Operator 
                keys: ym (type: string) 
                mode: hash  
                outputColumnNames: _col0    
                Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE 
                Reduce Output Operator  
                  key expressions: _col0 (type: string) 
                  sort order: + 
                  Map-reduce partition columns: _col0 (type: string)    
                  Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE   
      Execution mode: vectorized    
      Reduce Operator Tree: 
        Group By Operator   
          keys: KEY._col0 (type: string)    
          mode: mergepartial    
          outputColumnNames: _col0  
          Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE   
          Group By Operator 
            keys: _col0 (type: string)  
            mode: hash  
            outputColumnNames: _col0    
            Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE 
            File Output Operator    
              compressed: false 
              table:    
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat    
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat  
                  serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe   

  Stage: Stage-11   
    Conditional Operator    

  Stage: Stage-13   
    Map Reduce Local Work   
      Alias -> Map Local Tables:    
        _u1-subquery1:tmp1:a:$INTNAME   
          Fetch Operator    
            limit: -1   
      Alias -> Map Local Operator Tree: 
        _u1-subquery1:tmp1:a:$INTNAME   
          TableScan 
            HashTable Sink Operator 
              keys: 
                0 ym (type: string) 
                1 _col0 (type: string)  

  Stage: Stage-10   
    Map Reduce  
      Map Operator Tree:    
          TableScan 
            alias: zhihu.zhihu_answer   
            filterExpr: ym is not null (type: boolean)  
            Statistics: Num rows: 76466394 Data size: 16134409134 Basic stats: COMPLETE Column stats: PARTIAL   
            Map Join Operator   
              condition map:    
                   Left Semi Join 0 to 1    
              keys: 
                0 ym (type: string) 
                1 _col0 (type: string)  
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27   
              Statistics: Num rows: 4024547 Data size: 7356871916 Basic stats: COMPLETE Column stats: PARTIAL   
              File Output Operator  
                compressed: false   
                table:  
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat  
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat    
                    serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe 
      Local Work:   
        Map Reduce Local Work   

  Stage: Stage-9    
    Conditional Operator    

  Stage: Stage-12   
    Map Reduce Local Work   
      Alias -> Map Local Tables:    
        _u1-subquery1:tmp1:b    
          Fetch Operator    
            limit: -1   
      Alias -> Map Local Operator Tree: 
        _u1-subquery1:tmp1:b    
          TableScan 
            alias: b    
            Statistics: Num rows: 101549 Data size: 2741823 Basic stats: COMPLETE Column stats: NONE    
            HashTable Sink Operator 
              keys: 
                0 _col3 (type: string), _col15 (type: string)   
                1 answer_id (type: string), insert_time (type: string)  

  Stage: Stage-8    
    Map Reduce  
      Map Operator Tree:    
          TableScan 
            Map Join Operator   
              condition map:    
                   Left Outer Join0 to 1    
              keys: 
                0 _col3 (type: string), _col15 (type: string)   
                1 answer_id (type: string), insert_time (type: string)  
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col31   
              Statistics: Num rows: 4427001 Data size: 8092559283 Basic stats: COMPLETE Column stats: NONE  
              Filter Operator   
                predicate: _col31 is null (type: boolean)   
                Statistics: Num rows: 2213500 Data size: 4046278727 Basic stats: COMPLETE Column stats: NONE    
                Select Operator 
                  expressions: _col0 (type: boolean), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: smallint), _col12 (type: boolean), _col13 (type: int), _col14 (type: string), _col15 (type: string), _col16 (type: boolean), _col17 (type: boolean), _col18 (type: boolean), _col19 (type: boolean), _col20 (type: string), _col21 (type: string), _col22 (type: string), _col23 (type: string), _col24 (type: int), _col25 (type: int), _col26 (type: int), _col27 (type: string) 
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27   
                  Statistics: Num rows: 2213500 Data size: 4046278727 Basic stats: COMPLETE Column stats: NONE  
                  File Output Operator  
                    compressed: false   
                    table:  
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat  
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat    
                        serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe 
      Local Work:   
        Map Reduce Local Work   

  Stage: Stage-3    
    Map Reduce  
      Map Operator Tree:    
          TableScan 
            Union   
              Statistics: Num rows: 2315049 Data size: 4067705566 Basic stats: COMPLETE Column stats: PARTIAL   
              Reduce Output Operator    
                sort order:     
                Map-reduce partition columns: _col3 (type: string)  
                Statistics: Num rows: 2315049 Data size: 4067705566 Basic stats: COMPLETE Column stats: PARTIAL 
                value expressions: _col0 (type: boolean), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: smallint), _col12 (type: boolean), _col13 (type: int), _col14 (type: string), _col15 (type: string), _col16 (type: boolean), _col17 (type: boolean), _col18 (type: boolean), _col19 (type: boolean), _col20 (type: string), _col21 (type: string), _col22 (type: string), _col23 (type: string), _col24 (type: int), _col25 (type: int), _col26 (type: int), _col27 (type: string) 
          TableScan 
            alias: t    
            Statistics: Num rows: 101549 Data size: 21426839 Basic stats: COMPLETE Column stats: PARTIAL    
            Select Operator 
              expressions: admin_closed_comment (type: boolean), answer_content (type: string), answer_created (type: string), answer_id (type: string), answer_updated (type: string), author_headline (type: string), author_id (type: string), author_name (type: string), author_type (type: string), author_url_token (type: string), avatar_url (type: string), badge_num (type: smallint), can_comment (type: boolean), comment_count (type: int), gender (type: string), insert_time (type: string), is_advertiser (type: boolean), is_collapsed (type: boolean), is_copyable (type: boolean), is_org (type: boolean), question_created (type: string), question_id (type: string), question_title (type: string), question_type (type: string), reward_member_count (type: int), reward_total_money (type: int), voteup_count (type: int), ym (type: string)   
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27   
              Statistics: Num rows: 101549 Data size: 21426839 Basic stats: COMPLETE Column stats: PARTIAL  
              Union 
                Statistics: Num rows: 2315049 Data size: 4067705566 Basic stats: COMPLETE Column stats: PARTIAL 
                Reduce Output Operator  
                  sort order:   
                  Map-reduce partition columns: _col3 (type: string)    
                  Statistics: Num rows: 2315049 Data size: 4067705566 Basic stats: COMPLETE Column stats: PARTIAL   
                  value expressions: _col0 (type: boolean), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: smallint), _col12 (type: boolean), _col13 (type: int), _col14 (type: string), _col15 (type: string), _col16 (type: boolean), _col17 (type: boolean), _col18 (type: boolean), _col19 (type: boolean), _col20 (type: string), _col21 (type: string), _col22 (type: string), _col23 (type: string), _col24 (type: int), _col25 (type: int), _col26 (type: int), _col27 (type: string)   
      Reduce Operator Tree: 
        Select Operator 
          expressions: VALUE._col0 (type: boolean), VALUE._col1 (type: string), VALUE._col2 (type: string), VALUE._col3 (type: string), VALUE._col4 (type: string), VALUE._col5 (type: string), VALUE._col6 (type: string), VALUE._col7 (type: string), VALUE._col8 (type: string), VALUE._col9 (type: string), VALUE._col10 (type: string), VALUE._col11 (type: smallint), VALUE._col12 (type: boolean), VALUE._col13 (type: int), VALUE._col14 (type: string), VALUE._col15 (type: string), VALUE._col16 (type: boolean), VALUE._col17 (type: boolean), VALUE._col18 (type: boolean), VALUE._col19 (type: boolean), VALUE._col20 (type: string), VALUE._col21 (type: string), VALUE._col22 (type: string), VALUE._col23 (type: string), VALUE._col24 (type: int), VALUE._col25 (type: int), VALUE._col26 (type: int), VALUE._col27 (type: string) 
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27   
          Statistics: Num rows: 2315049 Data size: 425969016 Basic stats: COMPLETE Column stats: PARTIAL    
          File Output Operator  
            compressed: false   
            Statistics: Num rows: 2315049 Data size: 425969016 Basic stats: COMPLETE Column stats: PARTIAL  
            table:  
                input format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat 
                output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
                serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
                name: zhihu.zhihu_answer    

  Stage: Stage-0    
    Move Operator   
      tables:   
          partition:    
            ym  
          replace: true 
          table:    
              input format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat   
              output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat 
              serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe    
              name: zhihu.zhihu_answer  

  Stage: Stage-4    
    Stats-Aggr Operator 

  Stage: Stage-2    
    Map Reduce  
      Map Operator Tree:    
          TableScan 
            Reduce Output Operator  
              key expressions: _col3 (type: string), _col15 (type: string)  
              sort order: ++    
              Map-reduce partition columns: _col3 (type: string), _col15 (type: string) 
              Statistics: Num rows: 4024547 Data size: 7356871916 Basic stats: COMPLETE Column stats: PARTIAL   
              value expressions: _col0 (type: boolean), _col1 (type: string), _col2 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: smallint), _col12 (type: boolean), _col13 (type: int), _col14 (type: string), _col16 (type: boolean), _col17 (type: boolean), _col18 (type: boolean), _col19 (type: boolean), _col20 (type: string), _col21 (type: string), _col22 (type: string), _col23 (type: string), _col24 (type: int), _col25 (type: int), _col26 (type: int), _col27 (type: string)    
          TableScan 
            alias: b    
            Statistics: Num rows: 101549 Data size: 2741823 Basic stats: COMPLETE Column stats: NONE    
            Reduce Output Operator  
              key expressions: answer_id (type: string), insert_time (type: string) 
              sort order: ++    
              Map-reduce partition columns: answer_id (type: string), insert_time (type: string)    
              Statistics: Num rows: 101549 Data size: 2741823 Basic stats: COMPLETE Column stats: NONE  
      Reduce Operator Tree: 
        Join Operator   
          condition map:    
               Left Outer Join0 to 1    
          keys: 
            0 _col3 (type: string), _col15 (type: string)   
            1 answer_id (type: string), insert_time (type: string)  
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col31   
          Statistics: Num rows: 4427001 Data size: 8092559283 Basic stats: COMPLETE Column stats: NONE  
          Filter Operator   
            predicate: _col31 is null (type: boolean)   
            Statistics: Num rows: 2213500 Data size: 4046278727 Basic stats: COMPLETE Column stats: NONE    
            Select Operator 
              expressions: _col0 (type: boolean), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: smallint), _col12 (type: boolean), _col13 (type: int), _col14 (type: string), _col15 (type: string), _col16 (type: boolean), _col17 (type: boolean), _col18 (type: boolean), _col19 (type: boolean), _col20 (type: string), _col21 (type: string), _col22 (type: string), _col23 (type: string), _col24 (type: int), _col25 (type: int), _col26 (type: int), _col27 (type: string) 
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27   
              Statistics: Num rows: 2213500 Data size: 4046278727 Basic stats: COMPLETE Column stats: NONE  
              File Output Operator  
                compressed: false   
                table:  
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat  
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat    
                    serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe 

  Stage: Stage-1    
    Map Reduce  
      Map Operator Tree:    
          TableScan 
            alias: zhihu.zhihu_answer   
            filterExpr: ym is not null (type: boolean)  
            Statistics: Num rows: 76466394 Data size: 16134409134 Basic stats: COMPLETE Column stats: PARTIAL   
            Reduce Output Operator  
              key expressions: ym (type: string)    
              sort order: + 
              Map-reduce partition columns: ym (type: string)   
              Statistics: Num rows: 76466394 Data size: 16134409134 Basic stats: COMPLETE Column stats: PARTIAL 
              value expressions: admin_closed_comment (type: boolean), answer_content (type: string), answer_created (type: string), answer_id (type: string), answer_updated (type: string), author_headline (type: string), author_id (type: string), author_name (type: string), author_type (type: string), author_url_token (type: string), avatar_url (type: string), badge_num (type: smallint), can_comment (type: boolean), comment_count (type: int), gender (type: string), insert_time (type: string), is_advertiser (type: boolean), is_collapsed (type: boolean), is_copyable (type: boolean), is_org (type: boolean), question_created (type: string), question_id (type: string), question_title (type: string), question_type (type: string), reward_member_count (type: int), reward_total_money (type: int), voteup_count (type: int)    
          TableScan 
            Reduce Output Operator  
              key expressions: _col0 (type: string) 
              sort order: + 
              Map-reduce partition columns: _col0 (type: string)    
              Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE   
      Reduce Operator Tree: 
        Join Operator   
          condition map:    
               Left Semi Join 0 to 1    
          keys: 
            0 ym (type: string) 
            1 _col0 (type: string)  
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27   
          Statistics: Num rows: 4024547 Data size: 7356871916 Basic stats: COMPLETE Column stats: PARTIAL   
          File Output Operator  
            compressed: false   
            table:  
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat  
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat    
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe 

@leftjoin's HQL Explain: @leftjoin 的 HQL 说明:

Explain 
STAGE DEPENDENCIES: 
  Stage-3 is a root stage   
  Stage-6 depends on stages: Stage-3 , consists of Stage-7, Stage-1 
  Stage-7 has a backup stage: Stage-1   
  Stage-5 depends on stages: Stage-7    
  Stage-2 depends on stages: Stage-1, Stage-5   
  Stage-1   
  Stage-0 depends on stages: Stage-2    

STAGE PLANS:    
  Stage: Stage-3    
    Map Reduce  
      Map Operator Tree:    
          TableScan 
            alias: zhihu_answer_increment   
            filterExpr: ym is not null (type: boolean)  
            Statistics: Num rows: 101549 Data size: 21426839 Basic stats: COMPLETE Column stats: COMPLETE   
            Select Operator 
              expressions: ym (type: string)    
              outputColumnNames: ym 
              Statistics: Num rows: 101549 Data size: 21426839 Basic stats: COMPLETE Column stats: COMPLETE 
              Group By Operator 
                keys: ym (type: string) 
                mode: hash  
                outputColumnNames: _col0    
                Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE 
                Reduce Output Operator  
                  key expressions: _col0 (type: string) 
                  sort order: + 
                  Map-reduce partition columns: _col0 (type: string)    
                  Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE   
      Execution mode: vectorized    
      Reduce Operator Tree: 
        Group By Operator   
          keys: KEY._col0 (type: string)    
          mode: mergepartial    
          outputColumnNames: _col0  
          Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE   
          Group By Operator 
            keys: _col0 (type: string)  
            mode: hash  
            outputColumnNames: _col0    
            Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: COMPLETE 
            File Output Operator    
              compressed: false 
              table:    
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat    
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat  
                  serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe   

  Stage: Stage-6    
    Conditional Operator    

  Stage: Stage-7    
    Map Reduce Local Work   
      Alias -> Map Local Tables:    
        s:s-subquery1:$INTNAME  
          Fetch Operator    
            limit: -1   
      Alias -> Map Local Operator Tree: 
        s:s-subquery1:$INTNAME  
          TableScan 
            HashTable Sink Operator 
              keys: 
                0 ym (type: string) 
                1 _col0 (type: string)  

  Stage: Stage-5    
    Map Reduce  
      Map Operator Tree:    
          TableScan 
            alias: t    
            filterExpr: ym is not null (type: boolean)  
            Statistics: Num rows: 76466394 Data size: 16134409134 Basic stats: COMPLETE Column stats: PARTIAL   
            Map Join Operator   
              condition map:    
                   Left Semi Join 0 to 1    
              keys: 
                0 ym (type: string) 
                1 _col0 (type: string)  
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27   
              Statistics: Num rows: 4024547 Data size: 7340773728 Basic stats: COMPLETE Column stats: PARTIAL   
              Select Operator   
                expressions: _col0 (type: boolean), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: smallint), _col12 (type: boolean), _col14 (type: string), _col15 (type: string), _col16 (type: boolean), _col17 (type: boolean), _col18 (type: boolean), _col19 (type: boolean), _col20 (type: string), _col21 (type: string), _col22 (type: string), _col23 (type: string), _col24 (type: int), _col25 (type: int), _col26 (type: int), _col27 (type: string), 0 (type: int)    
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27 
                Statistics: Num rows: 4024547 Data size: 756614836 Basic stats: COMPLETE Column stats: PARTIAL  
                File Output Operator    
                  compressed: false 
                  table:    
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat    
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat  
                      serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe   
      Local Work:   
        Map Reduce Local Work   

  Stage: Stage-2    
    Map Reduce  
      Map Operator Tree:    
          TableScan 
            Union   
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000;
set hive.execution.engine=tez;

insert overwrite table zhihu.answer partition(ym)
select col1, col2 ... coln, ym, --list all columns
from
(
select col1, col2 ... coln, ym, --list all columns
       row_number() over(partition by ym, answer_id, insert_time order by new_flag desc) rn
from
 (
     select col1, col2 ... coln, ym, --list all columns
            0 as new_flag
       from zhihu.answer t
      where t.ym in (select distinct ym from zhihu.answer_increment)

   UNION ALL 

     select col1, col2 ... coln, ym, --list all columns
            1 as new_flag
       from zhihu.answer_increment t 
 )s
)s
where s.rn=1;

Indexes were removed in Hive 3.0 , more details in this Jira: HIVE-18448 在 Hive 3.0 中删除了索引,此 Jira 中的更多详细信息: HIVE-18448

See also this answer: https://stackoverflow.com/a/37744071/2700344另请参阅此答案: https://stackoverflow.com/a/37744071/2700344

Also tune parallelism for better performance: https://stackoverflow.com/a/48487306/2700344还可以调整并行度以获得更好的性能: https://stackoverflow.com/a/48487306/2700344

UPDATE: I studied plans provided by @DennisLi.更新:我研究了@DennisLi 提供的计划。 Some observations:一些观察:

  1. Join of big table with the whole increment one is performed as a map-join.与整个增量一的大表连接作为映射连接执行。 In this case FULL join approach can be better than UNION ALL+row_number.在这种情况下,FULL join 方法可能比 UNION ALL+row_number 更好。

  2. join with partition list is already transformed by optimizer to LEFT SEMI JOIN (works as map join also), after filtering there are 4K rows out of total 70M. join with partition list 已经被优化器转换为 LEFT SEMI JOIN(也可以用作 map join),过滤后总共 70M 中有 4K 行。 I recommend to calculate min and max increment ym partition separately and pass them as a parameters using WHERE ym>= ${min_increment_ym} and ym<=${max_increment_ym} In this case partition pruning will filter data efficiently without join.我建议分别计算最小和最大增量 ym 分区,并使用 WHERE ym>= ${min_increment_ym} 和 ym<=${max_increment_ym} 将它们作为参数传递。在这种情况下,分区修剪将有效地过滤数据而无需连接。 But it can be applied only if applicable to the increment dataset (increment contains single small range of partitions and we can use min and max efficiently) Implementing this will give you the maximum benefit但它只能在适用于增量数据集的情况下应用(增量包含单个小范围的分区,我们可以有效地使用 min 和 max)实现这一点将为您带来最大的好处

  3. Intermediate compression is not enabled.未启用中间压缩。 Enabling compression may give you a little, but it worth trying启用压缩可能会给你一点,但值得一试

Recommended approach:推荐方法:

set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000;

--check compression influence separately.
--it may give some improvement depending on your data entropy
set hive.exec.compress.intermediate=true;
set mapred.output.compress=true;
set hive.exec.compress.output=true;

insert overwrite table zhihu.answer partition(ym)
 select --select increment if exists, old if not exists
  case when i.answer_id is not null then i.col1 else t.col1 end as col1,
  ... --for all columns
  case when i.answer_id is not null then i.coln else t.coln end as coln,
  --partition is the last one
  case when i.answer_id is not null then i.ym else t.ym end as ym
   from zhihu.answer t
        full join zhihu.answer_increment i
        on t.answer_id   = i.answer_id
       and t.insert_time = i.insert_time
       and t.ym=i.ym --check this condition
  where t.ym in (select distinct ym from zhihu.answer_increment) --try to implement min and max parameters instead of this if possible (see 2)
  --alternatively if you do not want to employ shell, check if you can
  --remove the WHERE condition providing ym in the join condition, 
  --this will allow to get rid of the second join in the plan, 
  --though partition pruning will work with parameters better 
;

Finally the plan will be the most optimal one.最后,该计划将是最优化的。

And you still may need to tune parallelism on mappers and reducers based on your execution logs, see previous answer recommendation而且您可能仍然需要根据执行日志调整映射器和减速器的并行度,请参阅先前的答案推荐

There are few things you can do你可以做的事情很少

  1. Try to use left semi join in your first query.尝试在您的第一个查询中使用左半联接。

  2. As your table zhihu.answer_increment is very small in comparison to you can try giving map side join hints由于您的表 zhihu.answer_increment 与您相比非常小,您可以尝试提供 map 侧连接提示

  3. If your tables is bucketed on same columns and both bucktes are same try with SMB approach如果您的表存储在相同的列上并且两个存储桶相同,请尝试使用 SMB 方法

  4. Try setting below properties尝试设置以下属性

    1. set hive.exec.compress.intermediate=true;设置 hive.exec.compress.intermediate=true;
    2. set hive.exec.parallel=true;设置 hive.exec.parallel=true;
    3. set hive.exec.parallel.thread.number=50;设置 hive.exec.parallel.thread.number=50;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM