Create mean column for specific columns depending on group in R

Question

I want to create a new column containg means of specific columns. The selected columns should depend on the group.

group	first	second	third
0	3	2	4
0	0	NA	5
0	2	7	1
1	3	1	6
1	4	0	NA
1	2	3	3
0	5	5	0
0	6	2	2
1	NA	1	3

As an example: I want a mean column with the following conditions:

if a row contains a "0" in group, the mean should be calculated from "first" and "second"
if a row contains a "1" in group, the mean should be calculated from "first" and "third"
if a cell contains NA it should be ignored

So the final dataframe should look something like this:

group	first	second	third	mean
0	3	2	4	2.5
0	0	NA	5	0
0	2	7	1	4.5
1	3	1	6	4.5
1	4	0	NA	4
1	2	3	3	2.5
0	5	5	0	5
0	6	2	2	4
1	NA	1	3	3

Since my dataframe contains over 50 variables (and a few thousand rows) and not just those I want the mean from I can't select specific columns by their column or row number (like c(2,5),). I was thinking about adding a condition that explains to R that it should calculate the mean from "first" and "second" only for those rows that have a "0" in group and then the same principle for group = 1. I have no idea how to combinde these conditions or how I can do this in several steps.

Answer 1

library(tidyverse)

tribble(
  ~group, ~first, ~second, ~third,
  0, 3, 2, 4,
  0, 0, NA, 5,
  0, 2, 7, 1,
  1, 3, 1, 6,
  1, 4, 0, NA,
  1, 2, 3, 3,
  0, 5, 5, 0,
  0, 6, 2, 2,
  1, NA, 1, 3
) |> 
  rowwise() |> 
  mutate(mean = if_else(group == 0, mean(c_across(c(first, second)), na.rm = TRUE), 
                        mean(c_across(c(first, third)), na.rm = TRUE)))

#> # A tibble: 9 × 5
#> # Rowwise: 
#>   group first second third  mean
#>   <dbl> <dbl>  <dbl> <dbl> <dbl>
#> 1     0     3      2     4   2.5
#> 2     0     0     NA     5   0  
#> 3     0     2      7     1   4.5
#> 4     1     3      1     6   4.5
#> 5     1     4      0    NA   4  
#> 6     1     2      3     3   2.5
#> 7     0     5      5     0   5  
#> 8     0     6      2     2   4  
#> 9     1    NA      1     3   3

^{Created on 2022-06-08 by the reprex package (v2.0.1)}

Answer 2

One way to do this would be to pivot the data to long format and use case_when() to add a weight variable of 0 (for values you want ignored) and 1 (for values you want included) according to your conditions. Use weighted.mean() to calculate your mean and pivot back to wide.

library(tidyr)
library(dplyr)

df %>%
  rowid_to_column() %>%
  pivot_longer(-c(rowid, group)) %>%
  mutate(weight = case_when(group == 0 & name == "third" ~ 0,
                            group == 1 & name == "second" ~ 0,
                            TRUE ~ 1)) %>%
  group_by(rowid) %>%
  mutate(mean = weighted.mean(value, weight, na.rm = TRUE)) %>%
  pivot_wider(-weight) %>%
  ungroup() %>%
  relocate(mean, .after = last_col())

# A tibble: 9 × 6
  rowid group first second third  mean
  <int> <dbl> <dbl>  <dbl> <dbl> <dbl>
1     1     0     3      2     4   2.5
2     2     0     0     NA     5   0  
3     3     0     2      7     1   4.5
4     4     1     3      1     6   4.5
5     5     1     4      0    NA   4  
6     6     1     2      3     3   2.5
7     7     0     5      5     0   5  
8     8     0     6      2     2   4  
9     9     1    NA      1     3   3

Answer 3

If you have many groups and many columns, then I would recommend a more programmatic approach. You can define a code list code_ls where you define which columns should be used for which group numbers. Then we can subset this with dplyr::cur_group()$group and use it in an across statement to select those columns and wrap that into rowMeans() . Note that we use all_of() inside across() to select columns based on a character vector. Since your groups are numeric and we want to subset code_ls by name we wrap cur_group()$group into as.character .

library(dplyr)

code_ls <- list(`0` = c("first", "second"),
                `1` = c("first", "third"))

dat %>% 
  group_by(group) %>% 
  mutate(mean = rowMeans(across(
    all_of(code_ls[[as.character(cur_group()$group)]])
  ), na.rm = TRUE))

#> # A tibble: 9 x 5
#> # Groups:   group [2]
#>   group first second third  mean
#>   <dbl> <dbl>  <dbl> <dbl> <dbl>
#> 1     0     3      2     4   2.5
#> 2     0     0     NA     5   0  
#> 3     0     2      7     1   4.5
#> 4     1     3      1     6   4.5
#> 5     1     4      0    NA   4  
#> 6     1     2      3     3   2.5
#> 7     0     5      5     0   5  
#> 8     0     6      2     2   4  
#> 9     1    NA      1     3   3

# the data
dat <- tribble(
  
  ~group, ~first, ~second, ~third,
  
   0,   3,  2,  4, 
   0,   0,  NA, 5, 
   0,   2,  7,  1, 
   1,   3,  1,  6, 
   1,   4,  0,  NA,  
   1,   2,  3,  3, 
   0,   5,  5,  0, 
   0,   6,  2,  2, 
   1,   NA,  1, 3
)

^{Created on 2022-06-08 by the reprex package (v2.0.1)}

Create mean column for specific columns depending on group in R

Question

3 answers

solution1
3 ACCPTED 2022-06-08 09:20:06

solution2
2 2022-06-08 09:17:11

solution3
2 2022-06-08 09:31:23

Create mean column for specific columns depending on group in R

Question

3 answers

solution1 3 ACCPTED 2022-06-08 09:20:06

solution2 2 2022-06-08 09:17:11

solution3 2 2022-06-08 09:31:23

solution1
3 ACCPTED 2022-06-08 09:20:06

solution2
2 2022-06-08 09:17:11

solution3
2 2022-06-08 09:31:23