简体   繁体   中英

R dplyr conditional mutate dependent on multiple columns

I am trying to check the values of various binary columns and then mutate a new column populated with a value conditional to those columns.

The new column should indicate which column included 1 when all others were 0 and record the number of that column. If multiple 1s are present, a zero should be recorded. Y1 is included as an indicator that some columns need to be ignored.

For example,
if X1=0, X2=0, X3=1; I'd like Xfoo to be 3,
if X1=1, X2=0, X3=1; I'd like Xfoo to be 0, etc

So far I have been achieving this using dplyr mutate and case_when as follows:

# create toy data:
set.seed(123)
N<-10
mydata <- data.frame(Y1 = sample(0 : 1, N, replace = TRUE),
                     X1 = sample(0 : 1, N, replace = TRUE),
                     X2 = sample(0 : 1, N, replace = TRUE),
                     X3 = sample(0 : 1, N, replace = TRUE))

# perform mutate:
mydata <- mydata %>%
  mutate(
    Xfoo = case_when(
      X1 == 1 & X2 == 0 & X3 == 0 ~ 1,
      X1 == 0 & X2 == 1 & X3 == 0 ~ 2,
      X1 == 0 & X2 == 0 & X3 == 1 ~ 3,
      TRUE ~ 0)
    )

I am happy with this approach given small numbers of variables, however, I realise that this quickly becomes less effective with large numbers of variables.

To combat this, I'd have thought that grouping conditions in some way would keep things concise. Something like:

case_when(
      X1 == 1 & c(X2, X3) ==0 ~ 1
...

I realise that my logic is wrong here, I've been experimenting with all() and any() but can't figure it out and would appreciate some pointers.

There's not really a built in function function for multiple comparisons but you could create one

all_same_as <- function(val, ...) {
   Reduce(function(a,b) a & (b==val), list(...), init = TRUE)
}

So this function allows you to pass in any number of columns, and it checks that all the values in those columns are equal to the first value you pass in. So the first parameter should be the number you want the rest to match And then you could use

mydata %>%
  mutate(
    Xfoo = case_when(
      X1 == 1 & all_same_as(0, X2, X3) ~ 1,
      X2 == 1 & all_same_as(0, X1, X3) ~ 2,
      X3 == 1 & all_same_as(0, X1, X2) ~ 3,
      TRUE ~ 0)
  )

You could check what column has the maximum value and extract the column names and the number from that (as indicated by @Henrik in the comments):

library(tidyverse)

mydata |>
  mutate(Xfoo = pmap_dbl(across(starts_with("X")), 
                                ~ if_else(sum(c(...)) == 1,
                                          names(c(...))[which.max(c(...))] |> parse_number(),
                                          0)))

Output:

# A tibble: 10 × 5
      Y1    X1    X2    X3  Xfoo
   <int> <int> <int> <int> <dbl>
 1     0     1     0     0     1
 2     0     1     1     1     0
 3     0     1     0     0     1
 4     1     0     0     1     3
 5     0     1     0     1     0
 6     1     0     0     0     0
 7     1     1     1     0     0
 8     1     0     1     0     2
 9     0     0     0     0     0
10     0     0     1     1     0

See also: For each row return the column name of the largest value

You could try this, but it will be less computationally efficient. I'd use {data.table} if you are working with a large dataset.

mydata %>%
  rowwise() %>%
  mutate(
    Xfoo = case_when(
      all(c(X2,X3) == 0) & X1 == 1 ~ 1,
      all(c(X1,X3) == 0) & X2 == 1 ~ 2,
      all(c(X1,X2) == 0) & X3 == 1 ~ 3,
      TRUE ~ 0)
  ) %>%
  ungroup()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM