简体   繁体   中英

Keeping group_by in tact while applying a filter within mutate in dplyr

I am trying to apply a filter within a mutate, but I haven't figured out the right way to apply a filter while keeping the data frame grouping in tact.

Here is a simple reproducible example:

# Sample data
my_dates = seq(as.Date("2020/1/1"), by = "month", length.out = 6) 
grp      = c(rep("A",3), rep("B", 3))
x        = c(2,4,6,8,10,12)

my_df <- data.frame(my_dates, grp, x)

    my_dates grp  x
1 2020-01-01   A  2
2 2020-02-01   A  4
3 2020-03-01   A  6
4 2020-04-01   B  8
5 2020-05-01   B 10
6 2020-06-01   B 12


# Pick a max date for which the data will be filtered
max_date <- "2020-05-01"


# Try to get the average by group, after filtering out the max date included
filt_data  <- my_df %>% 
  group_by(grp) %>% 
  mutate(included_data = my_dates < max_date,
         my_mean       = mean(filter(., my_dates < max_date)$x)
  )


# A tibble: 6 x 5
# Groups:   grp [2]
  my_dates   grp       x included_data my_mean
  <date>     <fct> <dbl> <lgl>           <dbl>
1 2020-01-01 A         2 TRUE                5
2 2020-02-01 A         4 TRUE                5
3 2020-03-01 A         6 TRUE                5
4 2020-04-01 B         8 TRUE                5
5 2020-05-01 B        10 FALSE               5
6 2020-06-01 B        12 FALSE               5

The output I was hoping to get was this, where the mean of included data for Group A = mean (2,4,6) = 4 and the mean of the included data for Group B = mean(8) = 8:

  my_dates   grp       x included_data my_mean
  <date>     <fct> <dbl> <lgl>           <dbl>
1 2020-01-01 A         2 TRUE                4
2 2020-02-01 A         4 TRUE                4
3 2020-03-01 A         6 TRUE                4
4 2020-04-01 B         8 TRUE                8
5 2020-05-01 B        10 FALSE               8
6 2020-06-01 B        12 FALSE               8

I'm not sure what the correct mutate and filter would be so help is appreciated, as is an explanation for why the above does not work as intended.

Thanks!

Here, it may be better to use the index from 'included_data' to subset the 'x' column instead of doing another filter

library(dplyr)
my_df %>% 
     group_by(grp) %>%
     mutate(included_data = my_dates < max_date, 
             my_mean = mean(x[included_data])) %>%
     ungroup

-output

# A tibble: 6 x 5
#  my_dates   grp       x included_data my_mean
#  <date>     <chr> <dbl> <lgl>           <dbl>
#1 2020-01-01 A         2 TRUE                4
#2 2020-02-01 A         4 TRUE                4
#3 2020-03-01 A         6 TRUE                4
#4 2020-04-01 B         8 TRUE                8
#5 2020-05-01 B        10 FALSE               8
#6 2020-06-01 B        12 FALSE               8

Regarding why the OP's code didn't work, the . is the full dataset and it is doing the subset from that full data instead of the grouped data. We could use cur_data() instead of .

my_df %>%
    group_by(grp) %>%
    mutate(included_data = my_dates < max_date, 
    my_mean = mean(filter(cur_data(), my_dates < max_date)$x)) %>% 
    ungroup
# A tibble: 6 x 5
#  my_dates   grp       x included_data my_mean
#  <date>     <chr> <dbl> <lgl>           <dbl>
#1 2020-01-01 A         2 TRUE                4
#2 2020-02-01 A         4 TRUE                4
#3 2020-03-01 A         6 TRUE                4
#4 2020-04-01 B         8 TRUE                8
#5 2020-05-01 B        10 FALSE               8
#6 2020-06-01 B        12 FALSE               8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM