简体   繁体   English

sql:按多个相关字段(日期,工作日,月份)分组

[英]sql: group by multiple correlated fields (date, weekday, month)

I am working on a SQL task. 我正在执行SQL任务。 The goal is to know how many flights there are on average, for a given day in a given month from the flights table. 目的是从航班表中获知在给定月份的给定日期内平均有多少个航班。

Input table: flights 输入表:航班

id              BIGINT
dep_day_of_week varchar (255)
dep_month       varchar (255)
dep_date        text

An example of the flights table. 航班表的示例。 There could be multiple entries for the same date. 同一日期可能有多个条目。

id  dep_day_of_week  dep_month   dep_date
1   Thursday         January     4/7/2005 15:24:00
2   Friday           February    5/6/2005 12:12:12
3   Friday           February    5/6/2005 15:12:12

I read a solution as following: 我阅读了以下解决方案:

SELECT a.dep_month,
       a.dep_day_of_week,
       AVG(a.flight_count) AS average_flights
  FROM (
        SELECT dep_month, dep_day_of_week, dep_date, 
         COUNT(*) AS flight_count
        FROM flights
        GROUP BY 1,2,3
       ) a
 GROUP BY 1,2
 ORDER BY 1,2;

My question is in the subquery which calculate the number of flights per day: 我的问题在子查询中,该子查询计算每天的航班数量:

SELECT dep_month, dep_day_of_week, dep_date, COUNT(*) AS flight_count
FROM flights
GROUP BY 1,2,3

Since dep_month , dep_day_of_week , dep_date are three correlated attributes, with the dep_date might be the most detailed resolution of the three. 由于dep_monthdep_day_of_weekdep_date是三个相关属性,因此dep_date可能是这三个属性中最详细的解决方案。 So I thought GROUP BY 1,2,3 will do the same function as GROUP BY 3 . 所以我认为GROUP BY 1,2,3的功能与GROUP BY 3相同。

To examine what could be the possible differences, I use count(*) from .. . 为了检查可能存在的差异,我使用count(*) from .. to select all the terms resulted from the above subquery, 选择以上子查询产生的所有术语,

Select count(*) from (
    SELECT dep_month, dep_day_of_week, dep_date, COUNT(*) AS flight_count
    FROM flights
    GROUP BY 1,2,3 or Group Group by 3)

In the output, the counts for GROUP BY 1,2,3 and GROUP BY 3 , are 447 and 441, respectively. 在输出中, GROUP BY 1,2,3GROUP BY 3的计数分别为447和441。 Why there is any difference between these two grouping methods? 为什么这两种分组方法之间有区别?

Updates: 更新:

Thanks to @trincot excellent answer. 感谢@trincot出色的回答。 I use his suggested codes and found inconsistency in the input database. 我使用他的建议代码,并在输入数据库中发现不一致之处。

SELECT   dep_date, count(distinct dep_month), count(distinct dep_day_of_week)
FROM     flights
GROUP BY dep_date
HAVING   count(distinct dep_month) > 1
    OR   count(distinct dep_day_of_week) > 1

Output: 输出:

dep_date    count(distinct dep_month)   count(distinct dep_day_of_week)
1/16/2001   1   2
10/25/2003  1   2
2/23/2000   1   2
3/29/2001   1   2
4/3/2001    1   2
5/13/2000   1   2

Specifically, the database assigns Monday for 1/16/2001 8:25:00 and Tuesday for 1/16/2001 7:56:00 . 具体而言,数据库分配周一1/16/2001 8:25:00和周二1/16/2001 7:56:00 That is the reason of the inconsistency. 这就是不一致的原因。

As the date field has a time component, the count(*) in your subquery is going to be 1 every time, since the time component will be different and generate a new group. 由于日期字段具有时间成分,因此子查询中的count(*)每次将为1,因为时间成分将有所不同并生成一个新组。 Your groups are actually per second. 您的群组实际上是每秒。

You could get your results without subquery, like this: 您可以在没有子查询的情况下获得结果,如下所示:

select   dep_month,
         dep_day_of_week,
         count(*) /
             count(distinct substring_index(dep_date, ' ', 1)) avg_flights
from     flights
group by dep_month,
         dep_day_of_week

This counts all the flight records, and divides that by the number of different dates these flights are on. 这将计算所有的排期记录,并将其除以这些排期的不同日期数。 The date is extracted by only taking the part before the space. 仅通过在空格之前输入部分来提取日期。

Note that this means that when you don't have a record at all for a certain date, this day will not count in the average and might give a false impression. 请注意,这意味着当您完全没有某个日期的记录时,这一天将不会计入平均值,并可能给人留下错误的印象。 For instance, if in January there is only one Friday for which you have flights (let's say 10 of them), but there are 4 Fridays in January, you will still get an average of 10, even though 2.5 would be more reasonable. 例如,如果在一月份只有一个星期五可供您乘搭航班(假设其中有十个航班),但是一月份有四个星期五,您仍然可以平均乘以10,即使2.5更为合理。

About the difference in count 关于计数差

You state that this query returns 447 records: 您声明此查询返回447条记录:

Select count(*) from (
    SELECT dep_month, dep_day_of_week, dep_date, COUNT(*) AS flight_count
    FROM flights
    GROUP BY 1,2,3)

And this only 441: 而这只有441:

Select count(*) from (
    SELECT dep_month, dep_day_of_week, dep_date, COUNT(*) AS flight_count
    FROM flights
    GROUP BY 3)

This seems to indicate that you have identical dates in multiple records, but yet with difference in one of the first two columns, which would be an inconsistency. 这似乎表明您在多个记录中具有相同的日期,但是前两列之一却有所不同,这将是不一致的。 You can find out with this query: 您可以通过以下查询找到答案:

SELECT   dep_date, count(distinct dep_month), count(distinct dep_day_of_week)
FROM     flights
GROUP BY dep_date
HAVING   count(distinct dep_month) > 1
    OR   count(distinct dep_day_of_week) > 1

In a healthy data set, this query should return 0 records. 在健康的数据集中,此查询应返回0条记录。 If it returns records, you'll get the dates for which the month is not correctly set in at least one record, or the day of the week is not correctly set in at least one record. 如果返回记录,则将获得至少一个记录中未正确设置月份的日期,或者至少一个记录中未正确设置星期几的日期。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM