Hive - 有沒有辦法進一步優化 HiveQL 查詢？

Question

我寫了一個查詢來查找 3 月到 4 月美國最繁忙的 10 個機場。 它產生所需的輸出，但我想嘗試進一步優化它。

是否有任何 HiveQL 特定優化可以應用於查詢？ GROUPING SETS在這里適用嗎？ 我是 Hive 的新手，目前這是我提出的最短查詢。

SELECT airports.airport, COUNT(Flights.FlightsNum) AS Total_Flights
FROM (
SELECT Origin AS Airport, FlightsNum 
  FROM flights_stats
  WHERE (Cancelled = 0 AND Month IN (3,4))
UNION ALL
SELECT Dest AS Airport, FlightsNum 
  FROM flights_stats
  WHERE (Cancelled = 0 AND Month IN (3,4))
) Flights
INNER JOIN airports ON (Flights.Airport = airports.iata AND airports.country = 'USA')
GROUP BY airports.airport
ORDER BY Total_Flights DESC
LIMIT 10;

表列如下：

機場

|iata|airport|city|state|country|

Flights_stats

|originAirport|destAirport|FlightsNum|Cancelled|Month|

Answer 1

如果您在union all之前進行聚合可能會有所幫助：

SELECT a.airport, SUM(cnt) AS Total_Flights
FROM ((SELECT Origin AS Airport, COUNT(*) as cnt 
       FROM flights_stats
       WHERE (Cancelled = 0 AND Month IN (3,4))
       GROUP BY Origin
      ) UNION ALL
      (SELECT Dest AS Airport, COUNT(*) as cnt
       FROM flights_stats
       WHERE Cancelled = 0 AND Month IN (3,4)
       GROUP BY Dest
      )
     ) f INNER JOIN
     airports a
     ON f.Airport = a.iata AND a.country = 'USA'
GROUP BY a.airport
ORDER BY Total_Flights DESC
LIMIT 10;

Answer 2

按機場（內連接）過濾並在 UNION ALL 之前進行聚合以減少傳遞給最終聚合減速器的數據集。 帶有連接的 UNION ALL 子查詢應該並行運行，並且比在 UNION ALL 之后連接更大的數據集更快。

SELECT f.airport, SUM(cnt) AS Total_Flights
FROM (
      SELECT a.airport, COUNT(*) as cnt 
       FROM flights_stats f
            INNER JOIN airports a ON f.Origin=a.iata AND a.country='USA'
       WHERE Cancelled = 0 AND Month IN (3,4)
       GROUP BY a.airport
       UNION ALL
      SELECT a.airport, COUNT(*) as cnt
       FROM flights_stats f
            INNER JOIN airports a ON f.Dest=a.iata AND a.country='USA'
       WHERE Cancelled = 0 AND Month IN (3,4)
       GROUP BY a.airport
     ) f 
GROUP BY f.airport
ORDER BY Total_Flights DESC
LIMIT 10
;

調整 mapjoins 並啟用並行執行：

set hive.exec.parallel=true;
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --size of table to fit in memory

使用 Tez 和矢量化，調整映射器和減速器並行性： https : //stackoverflow.com/a/48487306/2700344

Answer 3

我不認為 GROUPING SETS 在這里適用，因為您只按一個字段分組。

來自Apache Wiki ：“GROUP BY 中的 GROUPING SETS 子句允許我們在同一記錄集中指定多個 GROUP BY 選項。”

Answer 4

您可以對此進行測試，但是您遇到的情況是 Union 可能更好，因此您確實需要對其進行測試並返回：

SELECT airports.airport,
SUM(
  CASE 
     WHEN T1.FlightsNum IS NOT NULL THEN 1
     WHEN T2.FlightsNum IS NOT NULL THEN 1
     ELSE 0
  END 
  ) AS Total_Flights
FROM airports
LEFT JOIN (SELECT  Origin AS Airport, FlightsNum 
    FROM flights_stats
   WHERE (Cancelled = 0 AND Month IN (3,4))) t1 
 on t1.Airport = airports.iata
LEFT JOIN (SELECT Dest AS Airport, FlightsNum 
   FROM flights_stats
   WHERE (Cancelled = 0 AND Month IN (3,4))) t2
 on t1.Airport = airports.iata
GROUP BY airports.airport
ORDER BY Total_Flights DESC

Hive - 有沒有辦法進一步優化 HiveQL 查詢？

問題描述

4 個解決方案

解決方案1
3 2018-03-07 14:08:54

解決方案2
3 已采納 2018-03-07 14:33:26

解決方案3
2 2018-03-07 20:01:59

解決方案4
0 2018-03-07 14:17:49

Hive - 有沒有辦法進一步優化 HiveQL 查詢？

問題描述

4 個解決方案

解決方案1 3 2018-03-07 14:08:54

解決方案2 3 已采納 2018-03-07 14:33:26

解決方案3 2 2018-03-07 20:01:59

解決方案4 0 2018-03-07 14:17:49

解決方案1
3 2018-03-07 14:08:54

解決方案2
3 已采納 2018-03-07 14:33:26

解決方案3
2 2018-03-07 20:01:59

解決方案4
0 2018-03-07 14:17:49