简体   繁体   中英

How to group_by values and get the count for multiple attributes in dataframe using R

I have a dataframe of the below format. I am producing sample data, but I have thousands of record of similar format:

  ORIGIN    DEST  CARRIER_DELAY WEATHER_DELAY   NAS_DELAY

   JFK       MCO      1                0           47
   JFK       LAX                                         
   JFK       MCO      1                2           30
   LOG       DFW      12               20          3
   LOG       DFW

I need to group by origin and and destination and calculate number of occurrence(count) of each delay using dplyr function. The values present in delay columns are in minutes. I need to consider the values greater than 0 and increase the count by 1 for those values. There are null values present for certain rows and I need to ignore them as well.

The output should look like below:

   ORIGIN    DEST  CARR_DELAY_COUNT WEATHER_DELAY_COUNT  NAS_DELAY_COUNT 

   JFK       MCO      2                1                   2
   LOG       DFW      1                1                   1

I am using below dplyr function:

   flight.df %>%
   group_by(ORIGIN,DEST) %>%
   summarize(carr_delay=sum(CARRIER_DELAY,na.rm=TRUE),
   weather_delay=sum(WEATHER_DELAY,na.rm=TRUE),
   nas_delay=sum(NAS_DELAY,na.rm=TRUE) %>%
   group_by()   %>%
   {.} -> delays.df

The above function will generate sum of delay values grouping by each category of delay for a particular source and destination.

Here how do I need to insert another function for having the count of each delay apart from sum?

You can use summarize_each after a group_by using dplyr package. You'll have to rename the columns though.

library(dplyr)
df %>% group_by(ORIGIN, DEST) %>% summarize_each(funs(Count = sum(.>0, na.rm=T)))

Source: local data frame [3 x 5]
Groups: ORIGIN [?]

  ORIGIN   DEST CARRIER_DELAY WEATHER_DELAY NAS_DELAY
  (fctr) (fctr)         (int)         (int)     (int)
1    JFK    LAX             0             0         0
2    JFK    MCO             2             1         2
3    LOG    DFW             1             1         1

We can use data.table

library(data.table)
setDT(df)[, lapply(.SD, function(x) sum(x > 0, na.rm=TRUE)) , .(ORIGIN, DEST)]
#   ORIGIN DEST CARRIER_DELAY WEATHER_DELAY NAS_DELAY
#1:    JFK  MCO             2             1         2
#2:    JFK  LAX             0             0         0
#3:    LOG  DFW             1             1         1

NOTE: This straightforward method also provides the correct output as the accepted one.

It is also straightforward to calculate this using the base R function, aggregate .

aggregate(cbind("CARRIER_DELAY"=CARRIER_DELAY,
                "WEATHER_DELAY"=WEATHER_DELAY,
                "NAS_DELAY"=NAS_DELAY) ~ ORIGIN + DEST,
          data=df, FUN=function(x) sum(x > 0, na.rm=TRUE))

which returns

  ORIGIN DEST CARRIER_DELAY WEATHER_DELAY NAS_DELAY
1    LOG  DFW             1             1         1
2    JFK  MCO             2             1         2

I use cbind to group the summary variables together and to also give names to the output.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM