简体   繁体   中英

Aggregate hourly data for each month of the year

I've looked around for something similar, but couldn't find anything. I have an airport data set which looks something like this (I rounded the hours):

Date        Arrival_Time        Departure_Time        ...
2017-01-01  13:00               14:00                 ...
2017-01-01  16:00               17:00                 ...
2017-01-01  17:00               18:00                 ...
2017-01-01  11:00               12:00                 ...

The problem is that for some months, there isn't a flight for a specific time which means I have missing data for some hour. How can I extract hourly arrivals for each hour of every month so that there are no missing values?

I've tried using dplyr and doing the following:

arrivals <- allFlights %>% group_by(month(Date), Arrival_Time) %>%
                            summarise(n()) %>%
                            na.omit()

but the problem clearly arrises as group_by cannot fill in my missing data. I end up with data for every month, but not entries for some hour (eg no entry for month 1, hour 22:00).

I could currently get my answer by filtering out every month in its own list, and then fully merging them with a complete list of hours, but that's really slow as I have to do this 12 times. Ideally I'm trying to end up with something like this:

Hour    Month    January    February    March   ...   December
00:00     1        ###        ###        ###     ...    ###
01:00     1        ###        ###        ###     ...    ###
 ...
00:00     12       ###        ###        ###     ...    ###
23:00     12       ###        ###        ###     ...    ###

where ### is the number of flights for that hour of that month. Is there a nice way of doing this?

Note: I was thinking if I could somehow join every month's hours with my complete list of hours, and replace all na's with 0's, then that would work, but I couldn't figure out how to do it properly.

Hopefully the question makes sense. I'd gladly clarify if anything is unclear.

EDIT: If you want to try it with the nycflights13 package, you could reproduce my attempt with the following code:

  allFlights <- nycflights13::flights

  allFlights$arr_time <- format(strptime(substr(as.POSIXct(sprintf("%04.0f", allFlights$arr_time), format="%H%M"), 12, 16), '%H:%M'), '%H:00')

  arrivals <- allFlights %>% filter(carrier == "MQ") %>% group_by(month, arr_time) %>% summarise(n()) %>% na.omit()

Notice how arrivals doesn't have anything for month 1, hour 02:00, 03:00, etc. What I'm trying to do is have this be a complete data set with the missing hours filled in as 0.

Is this what you're trying to do? I'm not sure if I'm aggregating exactly how you want, but the !is.na should do what you're looking for.

arrivals <- allFlights %>% group_by(month(Date), Arrival_Time) %>%
            rowwise() %>%
            summarise(month = plyr::count(!is.na(Arrival_Time)))

Edit: I may not be clear. Do you want a zero to show for hours where there are no data?

So I'm circling it. There's a cool packaged, called padr that will "pad" the date/time entries with NAs for missing values. Because there is a time_hour field, you can use pad .

library(padr)
allFlightsPad <- allFlights %>% pad

Then you can summarize from there. See this page for info.

I think you can use the code below to generate what you need.

library(stringr)

dim_month_hour<-data.frame(expand.grid(hour=paste(str_pad(seq(0,23,1),2,"left","0"),"00",sep=":"),month=sort(unique(allFlights$month)),stringsAsFactors=F))

arrivals_full<-left_join(dim_month_hour,arrivals,by=c("hour"="arr_time","month"="month"))

arrivals_full[is.na(arrivals_full$`n()`),"n()"]<-0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM