简体   繁体   中英

R function to identify cases where condition is met x number of times in any of n number of columns?

I have a dataframe where I would like to identify cases (rows) where a given condition is met at least a certain number of times in a set of columns. In the toy example below, I would like to identify cases where "A" is the choice for two of three columns (Choice_1 to Choice_3). I do not care in which two of the three columns "A" is found. In my example, ID = 1 and ID = 4 would be identified.

This should work with any number of "A"s desired in any number of columns (eg if I wanted to identify cases where "A" is the choice in three of the four Choice columns, only ID = 1 would be identified).

ID <- 1:4
Choice_1 <- c("A", "B", "C", "D")
Choice_2 <- c("A", "D", "C", "A")
Choice_3 <- c("A", "C", "A", "A")
Choice_4 <- c("B", "B", "A", "B")

df <- data.frame(ID, Choice_1, Choice_2, Choice_3, Choice_4)

> df
ID Choice_1 Choice_2 Choice_3 Choice_4
 1        A        A        A        B
 2        B        D        C        B
 3        C        C        A        A
 4        D        A        A        B

One kind of roundabout way to do this would be to convert "A"s to 1 and all else to 0, sum the Choice columns I am interested in and check the sum is equal or higher than my threshold, but I feel like there must be a better way.

The way I imagine it, it would be some form of if_else statement included in a mutate so rows that match the condition would be identified with 1 and those that don't with 0:

df %>% mutate(cond_matched = if_else( two of (Choice_1, Choice_2, Choice_3) == "A", 1, 0))

ID Choice_1 Choice_2 Choice_3 Choice_4 cond_matched
 1        A        A        A        B            1
 2        B        D        C        B            0
 3        C        C        A        A            0
 4        D        A        A        B            1

I'm hoping I've just been searching with the wrong keywords. Thank you for any help!

A base R option wuld be to create logical matrix from selected columns ( df[2:4] == "A" ), get the row wise sum of TRUE elements and check if it is greater than or equal to 2, coerce the logical vector to binary with as.integer or + (hacky)

df$cond_matched <- +(rowSums(df[2:4] == "A") >= 2)
df$cond_matched
#[1] 1 0 0 1

Or with tidyverse (with a similar logic from base R solution, but not exactly the same syntax)

library(tidyverse)
df %>% 
    mutate(cond_matched = select(., 2:4) %>%
                            map(~ .x == 'A') %>%
                            reduce(`+`) %>%
                            `>=`(2) %>% 
                            as.integer)
#   ID Choice_1 Choice_2 Choice_3 Choice_4 cond_matched
#1  1        A        A        A        B            1
#2  2        B        D        C        B            0
#3  3        C        C        A        A            0
#4  4        D        A        A        B            1

One dplyr and tidyr possibility could be:

df %>%
 gather(var, val, -c(ID, Choice_4)) %>%
 group_by(ID) %>%
 summarise(cond_matched = as.integer(sum(val == "A") >= 2)) %>%
 ungroup() %>%
 left_join(df, by = c("ID" = "ID"))

     ID cond_matched Choice_1 Choice_2 Choice_3 Choice_4
  <int>        <int> <chr>    <chr>    <chr>    <chr>   
1     1            1 A        A        A        B       
2     2            0 B        D        C        B       
3     3            0 C        C        A        A       
4     4            1 D        A        A        B  

Or with just dplyr (using basically the same logic as @akrun):

df %>%
 mutate(cond_matched = as.integer(rowSums(.[-ncol(.)] == "A") >= 2))

To name the columns explicitly:

df %>%
 mutate(cond_matched = as.integer(rowSums(.[grepl("Choice_1|Choice_2|Choice_3", colnames(.))] == "A") >= 2))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM