简体   繁体   中英

How can I subset rows in a data frame in R if any value in one row match values in a vector?

I have dataset named DF1, which like this:

 V1    V2    V3   V4     V5    V6
A01N  A01N  A01P  Null  Null  Null
C09K  A61K  C09D  C08K  Null  Null                                              
A61K  A61P  A61P  A61K  A61K  A61K                                          
A01D  A01D  A01D  A01D  A01D  Null
E06A  Null  Null  Null  Null  Null                              

also a vector named V:

(A01N C09K A01D)

What I want is that subset DF1 based on the vector elements, if one row in DF1 have the elements in V, no matter in which column, then keep the row. if not, drop it. The result should be:

 V1    V2    V3   V4     V5    V6
A01N  A01N  A01P  Null  Null  Null
C09K  A61K  C09D  C08K  Null  Null                                                                              

I try to use subset(): test_t1 <- subset(DF1, DF1[,1:6] %in% V)

but I just know how to subset one column or row, how to handle multiple column?

Try with reshaping using tidyverse functions. You format columns to long to then compare with the vector of values. After that, filter and then reshape to wide. Here the code:

library(tidyverse)
#Data
vec <- c('A01N','C09K','A01D')
#Code
new <- df %>% mutate(id=row_number()) %>%
  pivot_longer(-id) %>%
  mutate(Flag=+(value%in%vec)) %>%
  group_by(id) %>%
  mutate(Sum=sum(Flag)) %>%
  filter(Sum>=1) %>%
  select(-c(Flag,Sum)) %>%
  pivot_wider(names_from = name,values_from=value) %>%
  ungroup %>% select(-id)

Output:

# A tibble: 3 x 6
  V1    V2    V3    V4    V5    V6   
  <chr> <chr> <chr> <chr> <chr> <chr>
1 A01N  A01N  A01P  Null  Null  Null 
2 C09K  A61K  C09D  C08K  Null  Null 
3 A01D  A01D  A01D  A01D  A01D  Null 

Or using base R with apply() :

#Code2
new <- df[apply(df,1,function(x) ifelse(sum(x %in% vec)>=1,1,0))==1,]

Output:

    V1   V2   V3   V4   V5   V6
1 A01N A01N A01P Null Null Null
2 C09K A61K C09D C08K Null Null
4 A01D A01D A01D A01D A01D Null

Some data used:

#Data
df <- structure(list(V1 = c("A01N", "C09K", "A61K", "A01D", "E06A"), 
    V2 = c("A01N", "A61K", "A61P", "A01D", "Null"), V3 = c("A01P", 
    "C09D", "A61P", "A01D", "Null"), V4 = c("Null", "C08K", "A61K", 
    "A01D", "Null"), V5 = c("Null", "Null", "A61K", "A01D", "Null"
    ), V6 = c("Null", "Null", "A61K", "Null", "Null")), class = "data.frame", row.names = c(NA, 
-5L))

If too many variables are producing issues, here a more simplified version of the code (Many thanks GregorThomas ):

#Code1
new <- df %>% mutate(id=row_number()) %>%
  pivot_longer(-id) %>%
  group_by(id) %>%
  filter(sum(value %in% vec) > 0) %>%
  pivot_wider(names_from = name,values_from=value) %>%
  ungroup %>% select(-id)

#Code2
new <- df[apply(df,1,function(x) sum(x %in% vec)>=1),]

This is a simple one-liner in base R:

DF1[rowSums(DF1 %in% vec) > 0, ]

An option in base R can be

subset(DF1, Reduce(`+`, lapply(DF1, `%in%`, vec)) > 0)

-output

#     V1   V2   V3   V4   V5   V6
#1 A01N A01N A01P Null Null Null
#2 C09K A61K C09D C08K Null Null
#4 A01D A01D A01D A01D A01D Null

data

DF1 <- structure(list(V1 = c("A01N", "C09K", "A61K", "A01D", "E06A"), 
    V2 = c("A01N", "A61K", "A61P", "A01D", "Null"), V3 = c("A01P", 
    "C09D", "A61P", "A01D", "Null"), V4 = c("Null", "C08K", "A61K", 
    "A01D", "Null"), V5 = c("Null", "Null", "A61K", "A01D", "Null"
    ), V6 = c("Null", "Null", "A61K", "Null", "Null")), 
    class = "data.frame", row.names = c(NA, 
-5L))

vec <-  c('A01N','C09K','A01D')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM