Aggregate hourly data for each month of the year

Question

I've looked around for something similar, but couldn't find anything. I have an airport data set which looks something like this (I rounded the hours):

Date        Arrival_Time        Departure_Time        ...
2017-01-01  13:00               14:00                 ...
2017-01-01  16:00               17:00                 ...
2017-01-01  17:00               18:00                 ...
2017-01-01  11:00               12:00                 ...

The problem is that for some months, there isn't a flight for a specific time which means I have missing data for some hour. How can I extract hourly arrivals for each hour of every month so that there are no missing values?

I've tried using dplyr and doing the following:

arrivals <- allFlights %>% group_by(month(Date), Arrival_Time) %>%
                            summarise(n()) %>%
                            na.omit()

but the problem clearly arrises as group_by cannot fill in my missing data. I end up with data for every month, but not entries for some hour (eg no entry for month 1, hour 22:00).

I could currently get my answer by filtering out every month in its own list, and then fully merging them with a complete list of hours, but that's really slow as I have to do this 12 times. Ideally I'm trying to end up with something like this:

Hour    Month    January    February    March   ...   December
00:00     1        ###        ###        ###     ...    ###
01:00     1        ###        ###        ###     ...    ###
 ...
00:00     12       ###        ###        ###     ...    ###
23:00     12       ###        ###        ###     ...    ###

where ### is the number of flights for that hour of that month. Is there a nice way of doing this?

Note: I was thinking if I could somehow join every month's hours with my complete list of hours, and replace all na's with 0's, then that would work, but I couldn't figure out how to do it properly.

Hopefully the question makes sense. I'd gladly clarify if anything is unclear.

EDIT: If you want to try it with the nycflights13 package, you could reproduce my attempt with the following code:

  allFlights <- nycflights13::flights

  allFlights$arr_time <- format(strptime(substr(as.POSIXct(sprintf("%04.0f", allFlights$arr_time), format="%H%M"), 12, 16), '%H:%M'), '%H:00')

  arrivals <- allFlights %>% filter(carrier == "MQ") %>% group_by(month, arr_time) %>% summarise(n()) %>% na.omit()

Notice how arrivals doesn't have anything for month 1, hour 02:00, 03:00, etc. What I'm trying to do is have this be a complete data set with the missing hours filled in as 0.

Answer 1

Is this what you're trying to do? I'm not sure if I'm aggregating exactly how you want, but the !is.na should do what you're looking for.

arrivals <- allFlights %>% group_by(month(Date), Arrival_Time) %>%
            rowwise() %>%
            summarise(month = plyr::count(!is.na(Arrival_Time)))

Edit: I may not be clear. Do you want a zero to show for hours where there are no data?

So I'm circling it. There's a cool packaged, called padr that will "pad" the date/time entries with NAs for missing values. Because there is a time_hour field, you can use pad .

library(padr)
allFlightsPad <- allFlights %>% pad

Then you can summarize from there. See this page for info.

Answer 2

I think you can use the code below to generate what you need.

library(stringr)

dim_month_hour<-data.frame(expand.grid(hour=paste(str_pad(seq(0,23,1),2,"left","0"),"00",sep=":"),month=sort(unique(allFlights$month)),stringsAsFactors=F))

arrivals_full<-left_join(dim_month_hour,arrivals,by=c("hour"="arr_time","month"="month"))

arrivals_full[is.na(arrivals_full$`n()`),"n()"]<-0

Aggregate hourly data for each month of the year

Question

2 answers

solution1
0 2018-03-09 22:18:27

solution2
0 ACCPTED 2018-03-10 01:09:10

Aggregate hourly data for each month of the year

Question

2 answers

solution1 0 2018-03-09 22:18:27

solution2 0 ACCPTED 2018-03-10 01:09:10

solution1
0 2018-03-09 22:18:27

solution2
0 ACCPTED 2018-03-10 01:09:10