有沒有一種方法可以在Impala中優化此查詢的性能？

Question

該查詢涉及4個表，耗時10.5小時才能完成：

第1步：

create table temp partitioned by (date_pull) stored as parquet as
select <fields>
from trans_ext -- this is the base table
inner join [shuffle] ac  -- fact_acc
inner join [shuffle] c  --related_acc
left join dt --trx_type

表的行計數統計信息：

trans_ext: 8,289,244,895 (72 partitions)
ac: 985,164,794 (1 partitions)
c: 17,496,531 (1 partition)
dt 4: 369 (1 partition)

步驟2：從temp創建計數表h

select related_cust, count(*) as ct from temp group by related_cust;

步驟3：通過內部聯接計數表創建最終表並應用where子句

select t.* 
from temp t
inner join [shuffle] h on h.related_cust=t.related_cust
where  t.related_cust is not null
and h.ct <=1000000
order by t.related_cust;

我在想如何消除計數表並直接創建最終結果？ 最終表大小：196億行。

任何想法？ 任何提示都受到高度贊賞。

Answer 1

我的第一個想法是從用於創建最終表的最后一個查詢中刪除order by子句。 此操作確實很昂貴，並且考慮到不會順序讀取數據，因此不會增加任何值，因此您不會從中獲得任何好處。

可以使用其他方法來實現相同的查詢，如果您可以解釋要解決的問題而不是用於解決該問題的查詢，這將很有用。

有沒有一種方法可以在Impala中優化此查詢的性能？

問題描述

1 個解決方案

解決方案1
0 2018-12-11 12:36:10

有沒有一種方法可以在Impala中優化此查詢的性能？

問題描述

1 個解決方案

解決方案1 0 2018-12-11 12:36:10

解決方案1
0 2018-12-11 12:36:10