[英]Hive: Query executing from hours
我嘗試在 Azure HDInsight 集群上執行下面的 hive 查詢,但它需要前所未有的時間才能完成。 是否實現了 hive 設置但沒有用。 以下是詳細信息:
Table
CREATE TABLE DB_MYDB.TABLE1(
MSTR_KEY STRING,
SDNT_ID STRING,
CLSS_CD STRING,
BRNCH_CD STRING,
SECT_CD STRING,
GRP_CD STRING,
GRP_NM STRING,
SUBJ_DES STRING,
GRP_DESC STRING,
DTL_DESC STRING,
ACTV_FLAG STRING,
CMP_NM STRING)
STORED AS ORC
TBLPROPERTIES ('ORC.COMPRESS'='SNAPPY');
Hive Query
INSERT OVERWRITE TABLE DB_MYDB.TABLE1
SELECT
CURR.MSTR_KEY,
CURR.SDNT_ID,
CURR.CLSS_CD,
CURR.BRNCH_CD,
CURR.SECT_CD,
CURR.GRP_CD,
CURR.GRP_NM,
CURR.SUBJ_DES,
CURR.GRP_DESC,
CURR.DTL_DESC,
'Y',
CURR.CMP_NM
FROM DB_MYDB.TABLE2 CURR
LEFT OUTER JOIN DB_MYDB.TABLE3 PREV
ON (CURR.SDNT_ID=PREV.SDNT_ID
AND CURR.CLSS_CD=PREV.CLSS_CD
AND CURR.BRNCH_CD=PREV.BRNCH_CD
AND CURR.SECT_CD=PREV.SECT_CD
AND CURR.GRP_CD=PREV.GRP_CD
AND CURR.GRP_NM=PREV.GRP_NM)
WHERE PREV.SDNT_ID IS NULL;
但是查詢運行了幾個小時。 以下是詳細信息:
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 46 46 0 0 0 0
Map 3 .......... SUCCEEDED 169 169 0 0 0 0
Reducer 2 .... RUNNING 1009 825 184 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/03 [======================>>----] 84% ELAPSED TIME: 13622.73 s
--------------------------------------------------------------------------------
我確實設置了一些 hive 屬性
SET hive.execution.engine=tez;
SET hive.tez.container.size=10240;
SET tez.am.resource.memory.mb=10240;
SET tez.task.resource.memory.mb=10240;
SET hive.auto.convert.join.noconditionaltask.size=3470;
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.vectorized.execution.reduce.groupby.enabled=true;
SET hive.cbo.enable=true;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
SET hive.compute.query.using.stats=true;
SET hive.merge.mapfiles = true;
SET hive.merge.mapredfiles = true;
SET hive.merge.tezfiles = true;
SET hive.merge.size.per.task=268435456;
SET hive.merge.smallfiles.avgsize=16777216;
SET hive.merge.orcfile.stripe.level=true;
Records in Tables:
DB_MYDB.TABLE2= 337319653
DB_MYDB.TABLE3= 1946526625
對查詢似乎沒有任何影響。 誰能幫我:
Using the versions:
Hadoop 2.7.3.2.6.5.3033-1
Hive 1.2.1000.2.6.5.3033-1
Azure HDInsight 3.6
嘗試_1:
正如@leftjoin 所建議的那樣,嘗試設置set hive.exec.reducers.bytes.per.reducer=32000000;
. 這一直有效到 hive 腳本的最后第二個步驟,但最后它失敗了, Caused by: java.io.IOException: Map_1: Shuffle failed with too many fetch failures and insufficient progress!
最后查詢:
INSERT OVERWRITE TABLE DB_MYDB.TABLE3
SELECT
CURR_FULL.MSTR_KEY,
CURR_FULL.SDNT_ID,
CURR_FULL.CLSS_CD,
CURR_FULL.BRNCH_CD,
CURR_FULL.GRP_CD,
CURR_FULL.CHNL_CD,
CURR_FULL.GRP_NM,
CURR_FULL.GRP_DESC,
CURR_FULL.SUBJ_DES,
CURR_FULL.DTL_DESC,
(CASE WHEN CURR_FULL.SDNT_ID = SND_DELTA.SDNT_ID THEN 'Y' ELSE
CURR_FULL.SDNT_ID_FLAG END) AS SDNT_ID_FLAG,
CURR_FULL.CMP_NM
FROM
DB_MYDB.TABLE2 CURR_FULL
LEFT OUTER JOIN DB_MYDB.TABLE1 SND_DELTA
ON (CURR_FULL.SDNT_ID = SND_DELTA.SDNT_ID);
-----------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
-----------------------------------------------------------------
Map 1 ......... RUNNING 1066 1060 6 0 0 0
Map 4 .......... SUCCEEDED 3 3 0 0 0 0
Reducer 2 RUNNING 1009 0 22 987 0 0
Reducer 3 INITED 1 0 0 1 0 0
-----------------------------------------------------------------
VERTICES: 01/04 [================>>--] 99% ELAPSED TIME: 18187.78 s
錯誤:
Caused by: java.io.IOException: Map_1: Shuffle failed with too many fetch failures and insufficient progress,failureCounts=8, pendingInputs=1058, fetcherHealthy=false, reducerProgressedEnough=false, reducerStalled=false
如果您的 fk 列上沒有索引,則應確定添加它們,這是我的建議:
create index idx_TABLE2 on table DB_MYDB.TABLE2 (SDNT_ID,CLSS_CD,BRNCH_CD,SECT_CD,GRP_CD,GRP_NM) AS 'COMPACT' WITH DEFERRED REBUILD;
create index idx_TABLE3 on table DB_MYDB.TABLE3(SDNT_ID,CLSS_CD,BRNCH_CD,SECT_CD,GRP_CD,GRP_NM) AS 'COMPACT' WITH DEFERRED REBUILD;
請注意 hive 3.0 版,索引已從 hive 中刪除,或者您可以使用物化視圖(受 Hive 2.3.0 和以上版本支持)
如果是減速器頂點運行緩慢,您可以通過減少每個減速器配置的字節數來增加減速器並行度。 檢查您當前的設置並相應地減少數字,直到您運行 2 倍或更多減速器:
set hive.exec.reducers.bytes.per.reducer=67108864; --example only, check your current settings
--and reduce accordingly to get twice more reducers on Reducer 2 vertex
更改設置,開始查詢,檢查Reducer 2
頂點上的容器數量,如果容器數量沒有增加,則終止並再次更改。
如果您還想增加映射器的並行度,請閱讀以下答案: https://stackoverflow.com/a/48487306/2700344
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.