繁体   English   中英

Hive - 有没有办法进一步优化 HiveQL 查询?

[英]Hive - Is there a way to further optimize a HiveQL query?

我写了一个查询来查找 3 月到 4 月美国最繁忙的 10 个机场。 它产生所需的输出,但我想尝试进一步优化它。

是否有任何 HiveQL 特定优化可以应用于查询? GROUPING SETS在这里适用吗? 我是 Hive 的新手,目前这是我提出的最短查询。

SELECT airports.airport, COUNT(Flights.FlightsNum) AS Total_Flights
FROM (
SELECT Origin AS Airport, FlightsNum 
  FROM flights_stats
  WHERE (Cancelled = 0 AND Month IN (3,4))
UNION ALL
SELECT Dest AS Airport, FlightsNum 
  FROM flights_stats
  WHERE (Cancelled = 0 AND Month IN (3,4))
) Flights
INNER JOIN airports ON (Flights.Airport = airports.iata AND airports.country = 'USA')
GROUP BY airports.airport
ORDER BY Total_Flights DESC
LIMIT 10;

表列如下:

机场

|iata|airport|city|state|country|

Flights_stats

|originAirport|destAirport|FlightsNum|Cancelled|Month|

如果您在union all之前进行聚合可能会有所帮助:

SELECT a.airport, SUM(cnt) AS Total_Flights
FROM ((SELECT Origin AS Airport, COUNT(*) as cnt 
       FROM flights_stats
       WHERE (Cancelled = 0 AND Month IN (3,4))
       GROUP BY Origin
      ) UNION ALL
      (SELECT Dest AS Airport, COUNT(*) as cnt
       FROM flights_stats
       WHERE Cancelled = 0 AND Month IN (3,4)
       GROUP BY Dest
      )
     ) f INNER JOIN
     airports a
     ON f.Airport = a.iata AND a.country = 'USA'
GROUP BY a.airport
ORDER BY Total_Flights DESC
LIMIT 10;

按机场(内连接)过滤并在 UNION ALL 之前进行聚合以减少传递给最终聚合减速器的数据集。 带有连接的 UNION ALL 子查询应该并行运行,并且比在 UNION ALL 之后连接更大的数据集更快。

SELECT f.airport, SUM(cnt) AS Total_Flights
FROM (
      SELECT a.airport, COUNT(*) as cnt 
       FROM flights_stats f
            INNER JOIN airports a ON f.Origin=a.iata AND a.country='USA'
       WHERE Cancelled = 0 AND Month IN (3,4)
       GROUP BY a.airport
       UNION ALL
      SELECT a.airport, COUNT(*) as cnt
       FROM flights_stats f
            INNER JOIN airports a ON f.Dest=a.iata AND a.country='USA'
       WHERE Cancelled = 0 AND Month IN (3,4)
       GROUP BY a.airport
     ) f 
GROUP BY f.airport
ORDER BY Total_Flights DESC
LIMIT 10
;

调整 mapjoins 并启用并行执行:

set hive.exec.parallel=true;
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --size of table to fit in memory

使用 Tez 和矢量化,调整映射器和减速器并行性: https : //stackoverflow.com/a/48487306/2700344

我不认为 GROUPING SETS 在这里适用,因为您只按一个字段分组。

来自Apache Wiki :“GROUP BY 中的 GROUPING SETS 子句允许我们在同一记录集中指定多个 GROUP BY 选项。”

您可以对此进行测试,但是您遇到的情况是 Union 可能更好,因此您确实需要对其进行测试并返回:

SELECT airports.airport,
SUM(
  CASE 
     WHEN T1.FlightsNum IS NOT NULL THEN 1
     WHEN T2.FlightsNum IS NOT NULL THEN 1
     ELSE 0
  END 
  ) AS Total_Flights
FROM airports
LEFT JOIN (SELECT  Origin AS Airport, FlightsNum 
    FROM flights_stats
   WHERE (Cancelled = 0 AND Month IN (3,4))) t1 
 on t1.Airport = airports.iata
LEFT JOIN (SELECT Dest AS Airport, FlightsNum 
   FROM flights_stats
   WHERE (Cancelled = 0 AND Month IN (3,4))) t2
 on t1.Airport = airports.iata
GROUP BY airports.airport
ORDER BY Total_Flights DESC

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM