简体   繁体   English

Hive如何在count(*)中使用Max?

[英]How to do Max in count(*) with Hive?

I have two tables : 我有两个表:

Fly : Year , Origin 起源

Airport : Code , Name 机场代码名称

Here is a sample of data : 这是数据示例:

Fly :

1989,SF    
1989,SF   
1989,NY  
1993,NY  
1998,Par     
1998,Par  
1998,NY

AirPort : 空港

SF, International Airport    
NY, Inter Air    
Par, Charles de Gaulle

I want to get the most used airport per year. 我想每年获得最繁忙的机场。

So firstly I did this request to get ths number of occurence of each airport per each year : 因此,首先我做了这个请求,以获取每年每个机场的发生次数:

SELECT v.Year,a.airport ,count(*)
From airports a JOIN Vol v ON (a.iata = v.Dest)
Group By v.Year,a.airport
ORDER BY Year ASC,airport ASC;

So i get this kind of result : 所以我得到这样的结果:

1989, San Francisco, 2  
1989, New York, 1
1993, New York, 1
1998, new York, 1
1998, Paris, 2

And I want the max of each year like this : 我想要这样的每年最多:

1989, San Francisco, 2
1993, New York, 1
1998, Paris, 2

Can i do it with one single request ? 我可以只提出一个请求吗? Should i use an intermediate table ? 我应该使用中间表吗?

Is it better with Pig ? 猪更好吗?

Thank you in advance 先感谢您

This is a little tricky in Hive, but certainly doable. 在Hive中,这有些棘手,但肯定是可行的。 It requires two things: using your first query as an subquery for a bigger one, and a little trick to do an "arg-max". 它需要两件事:将您的第一个查询用作较大查询的子查询,以及执行“ arg-max”的小技巧。

SELECT Year, max(named_struct('n', n, 'airport', airport)) FROM (
  SELECT v.Year, a.airport, count(*) as n
  FROM airports a JOIN Vol v ON (a.iata = v.Dest)
  GROUP BY v.Year, a.airport
) t
GROUP BY Year;

Notice that named_struct creates a struct field, and those compare in order of their first field first, so you get the correct "max" behavior while still retaining the airport name. 请注意,named_struct创建了一个struct字段,并且首先按第一个字段的顺序进行比较,因此您在保留机场名称的同时获得正确的“ max”行为。 This does mean that your output will be in the form of a struct, though: 但这确实意味着您的输出将采用struct的形式:

1989, {n:2, airport:San Francisco}
1993, {n:1, airport:New York}
1998, {n:2, airport:Paris}

If you want to "un-struct" it, you just need to select out those fields individually: 如果要“取消构造”,只需要单独选择这些字段即可:

SELECT Year, max(named_struct('n', n, 'airport', airport)).n, max(named_struct('n', n, 'airport', airport)).airport FROM (
  SELECT v.Year, a.airport, count(*) as n
  FROM airports a JOIN Vol v ON (a.iata = v.Dest)
  GROUP BY v.Year, a.airport
) t
GROUP BY Year;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM