简体   繁体   中英

How can I conditionally summarize different columns of data succinctly with dplyr in R?

I need to aggregate data in R. I have 8 columns, 3 of which are categorical and 5 of which are numeric and need to be summed conditionally based off of a combination of conditions from 2 of the categorical variables. My data looks like the below:

df <- structure(list(Color = c("Red", "Blue", "Blue", "Red", "Yellow"
), Weekend = c(1L, 0L, 1L, 0L, 1L), LeapYear = c(1L, 1L, 0L, 
0L, 0L), Length = c(15L, 20L, 10L, 15L, 15L), Height = c(50L, 
70L, 35L, 28L, 80L), Weight = c(120L, 130L, 120L, 105L, 140L), 
    Cost = c(25L, 50L, 55L, 65L, 80L), Purchases = c(5L, 10L, 
    5L, 10L, 15L)), class = "data.frame", row.names = c(NA, -5L
))

> df
   Color Weekend LeapYear Length Height Weight Cost Purchases
1    Red       1        1     15     50    120   25         5
2   Blue       0        1     20     70    130   50        10
3   Blue       1        0     10     35    120   55         5
4    Red       0        0     15     28    105   65        10
5 Yellow       1        0     15     80    140   80        15

I want to aggregate this table with conditional summations,

for example, sum Length and Height, but only for Leap Years, sum Height and Cost, but only for Leap Years and Weekends.

And I want these conditional summations grouped by color to look like the below:

Color Length Height Weight Cost Purchases Length_LeapYear Height_LeapYear Height_LeapYear_Weekend Cost_LeapYear_Weekend Purchases_Weekend
Red 30 78 225 90 15 15 50 50 25 5
Blue 30 105 250 105 15 20 70 0 0 5
Yellow 15 80 140 80 15 0 0 0 0 15

I am working in dplyr and have the following working to sum multiple fields on the same condition using summarise_at():

df %>% 
group_by(Color, Weekend, LeapYear) %>% 
summarise_at(c(Length_LeapYear == "Length", Height_LeapYear == "Height"), ~sum(.[LeapYear==1]))

But when I try to add conditions for my remaining conditionally summed variables, this removes my prior summarizations. Here is my idea for how I imagine the code to work.

df %>% 
group_by(Color, Weekend, LeapYear) %>% 
summarise_at(c("Length", "Height", "Weight", "Cost", "Purchases"), sum) %>%
summarise_at(c(Length_LeapYear == "Length", Height_LeapYear == "Height"), ~sum(.[LeapYear==1])) %>%
summarise_at(c(Height_LeapYear_Weekend == "Height", Cost_LeapYear_Weekend == "Cost"), ~sum(.[LeapYear==1 & Weekend ==1])) %>%
summarise(Purchases_Weekend = sum(Purchases)) %>%
group_by(Color)

Ultimately, I feel like there must be a way to get each of these differently conditioned summations into one call of summarise_at(). I also am unsure of the best practice for summing conditionally on columns (Weekend and LeapYear) an then omitting those columns from the final table. So help on that would be appreciated as well.

For the record, I do know that I can perform these manipulations with one long call to summarise(), where I individually condition each derived column. However, in practice, my dataset is a lot wider than this, and it just makes more sense to try to condense the data manipulation by grouping like conditions.

UPDATE On second thoughts I understood that you need to do it at once. I think the below syntax will do the job of summarising whole dataset (in the example cols 3 to col7) by four types of aggregation, at once

df %>% group_by(Color) %>%
  summarise(across(3:7, ~sum(.))) %>%
  left_join(df %>% group_by(Color) %>% summarise(across(3:7, ~sum(.*LeapYear), .names= "{.col}_LeapYear"))) %>%
  left_join(df %>% group_by(Color) %>% summarise(across(3:7, ~sum(.*Weekend), .names= "{.col}_Weekend"))) %>%
  left_join(df %>% group_by(Color) %>% summarise(across(3:7, ~sum(.*LeapYear*Weekend), .names= "{.col}_LeapYear_Weekend")))

# A tibble: 3 x 21
  Color Length Height Weight  Cost Purchases Length_LeapYear Height_LeapYear Weight_LeapYear Cost_LeapYear
  <chr>  <int>  <int>  <int> <int>     <int>           <int>           <int>           <int>         <int>
1 Blue      30    105    250   105        15              20              70             130            50
2 Red       30     78    225    90        15              15              50             120            25
3 Yell~     15     80    140    80        15               0               0               0             0
# ... with 11 more variables: Purchases_LeapYear <int>, Length_Weekend <int>, Height_Weekend <int>,
#   Weight_Weekend <int>, Cost_Weekend <int>, Purchases_Weekend <int>, Length_LeapYear_Weekend <int>,
#   Height_LeapYear_Weekend <int>, Weight_LeapYear_Weekend <int>, Cost_LeapYear_Weekend <int>,
#   Purchases_LeapYear_Weekend <int>

You can also pass on complete functions in a list too, like this (which will shorten your code further)

df %>% group_by(Color) %>%
  summarise(across(3:7, list(sum= ~sum(.), 
                             leapyear = ~sum(.*LeapYear), 
                             weekend = ~sum(.*Weekend), 
                             leapyear_weekend = ~sum(.*Weekend*LeapYear))))

# A tibble: 3 x 21
  Color Length_sum Length_leapyear Length_weekend Length_leapyear~ Height_sum Height_leapyear Height_weekend
  <chr>      <int>           <int>          <int>            <int>      <int>           <int>          <int>
1 Blue          30              20             10                0        105              70             35
2 Red           30              15             15               15         78              50             50
3 Yell~         15               0             15                0         80               0             80
# ... with 13 more variables: Height_leapyear_weekend <int>, Weight_sum <int>, Weight_leapyear <int>,
#   Weight_weekend <int>, Weight_leapyear_weekend <int>, Cost_sum <int>, Cost_leapyear <int>,
#   Cost_weekend <int>, Cost_leapyear_weekend <int>, Purchases_sum <int>, Purchases_leapyear <int>,
#   Purchases_weekend <int>, Purchases_leapyear_weekend <int>

sample dput(df) I have included in your question.

OLD ANSWER Do it like this

df %>% 
  group_by(Color) %>% 
  summarise(Length_s = sum(Length),
            Height_s = sum(Height),
            Weight_s = sum(Weight),
            Cost_s = sum(Cost),
            Purchases_s = sum(Purchases),
            Length_Leap_year = sum(Length * LeapYear),
            Height_Leap_year = sum(Height * LeapYear),
            Height_Leap_year_Weekend = sum(Height * LeapYear * Weekend),
            Purchases_Weekend = sum(Purchases * Weekend))

# A tibble: 3 x 10
  Color  Length_s Height_s Weight_s Cost_s Purchases_s Length_Leap_year Height_Leap_year Height_Leap_year_Weekend Purchases_Weeke~
  <chr>     <int>    <int>    <int>  <int>       <int>            <int>            <int>                    <int>            <int>
1 Blue         30      105      250    105          15               20               70                        0                5
2 Red          30       78      225     90          15               15               50                       50                5
3 Yellow       15       80      140     80          15                0                0                        0               15

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM