简体   繁体   中英

how to aggregate multiple columns of a dataframe with dplyr

A dataframe with a column id, a column category, a column cost and a column colour.

here is the dataframe df

library(dplyr)

id <- c(1, 1, 1, 2, 2, 3, 1) 
category <- (c("V", "V", "V", "W", "W", "W", "W"))
cost <- c(10, 15, 5, 2, 14, 20, 3)
colour <- c("red", "green", "red", "green", "blue","blue","blue")

df <- data.frame(id, category, cost, colour)
df$category <- as.character(df$category)

df
id    category    cost     colour
1     V           10       red
1     V           15       green
1     V           5        red
2     W           2        green
2     W           14       blue
3     W           20       blue
1     W           3        blue

here is the format of the df

'data.frame':   7 obs. of  4 variables:
 $ id       : num  1 1 1 2 2 3 1
 $ category : chr  "V" "V" "V" "W" ...
 $ cost: num  10 15 5 2 14 20 3
 $ colour   : Factor w/ 3 levels "blue","green",..: 3 2 3 2 1 1 1

I would like to have a new dataframe df_new and for each id the frequency (freq), the number of category entries where the entry is equal W (category_W), the number of category entries where the entry is equal V (category_V), the total cost of each id where the category entry is W (cost_W), the total cost of each id where the category entry is V (cost_V) and for each unique id the number of each colour entry (col_red, col_green, col_blue). The output should look like

id freq category_W    category_V    cost_W  cost_V    col_red  col_green col_blue
1  4      1             3             3       30        2           1       1     
2  2      2                          16                             1       1 
3  1      1                          20                                     1

I tried the following - but it doesn't work.

df_new <- group_by(df, id) %>% summarize(freq = count(id), category_W = count(category == "W", na.rm=TRUE), category_V = count(category == "V", na.rm=TRUE), col_red = count(colour == "red", na.rm=TRUE), col_green = count(colour == "green", na.rm=TRUE),  col_blue = count(colour == "blue", na.rm=TRUE))    

I have no clue how i can insert the condition for cost_W and cost_V. I get the error: length(rows) == 1 is not TRUE Thanks a lot in advance!

Well, you are almost there.

You can take advantage of the fact that logical values are converted into 0 and 1 in arithmetical operations. So when you sum them you get the count of specific values which the logical clause tested for.

You can use the same property to calculate the cost. Just multiply the logical clause with the cost variable. If the category matches your interest it is summed, otherwise, it is reduced to 0

df_new <-
    group_by(df, id) %>% summarize(
      freq = n(),
      category_W = sum(category == "W", na.rm = TRUE),
      category_V = sum(category == "V", na.rm = TRUE),
      cost_W = sum((category == "W") * cost, na.rm = TRUE),
      cost_V = sum((category == "V") * cost, na.rm = TRUE),
      col_red = sum(colour == "red", na.rm = TRUE),
      col_green = sum(colour == "green", na.rm = TRUE),
      col_blue = sum(colour == "blue", na.rm = TRUE)
  )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM