两列的部分字符串匹配 R

Question

我有一个很大的 df（例如这里只有 2 列）

CancerVar<-c("CancerVar:9#Tier_II_potential","CancerVar:2#Tier_IV_benign","CancerVar:11#Tier_I_strong","CancerVar:2#Tier_IV_benign","CancerVar:2#Tier_IV_benign")
driver_mut_prediction<-c("not protein-affecting","TIER 1","passenger","TIER 2","passenger")
df<-data.frame(CancerVar,driver_mut_prediction)


  df
                      CancerVar driver_mut_prediction
1 CancerVar:9#Tier_II_potential not protein-affecting
2    CancerVar:2#Tier_IV_benign                TIER 1
3    CancerVar:11#Tier_I_strong             passenger
4    CancerVar:2#Tier_IV_benign                TIER 2
5    CancerVar:2#Tier_IV_benign             passenger

我想 select 行在两列上使用部分（不同的）字符串匹配。 我想要 select 行，其中 EITHER（CancerVar 包含 Tier I 或 Tier II）或（driver_mut_prediction 包含 TIER 1 或 TIER 2）

我努力了：

df_sub<-df[with(df, grepl("TIER|Tier_I|Tier_II", paste(driver_mut_prediction, CancerVar,ignore.case=FALSE))),]

仍然有最后一行（所以两个条件都不起作用）

我努力了：

df %>% select(contains("Tier_I|Tier_II|TIER 1|TIER 2"))

具有 0 列和 5000 行的数据框

请帮忙！

Answer 1

这种方法应该有效：

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
CancerVar<-c("CancerVar:9#Tier_II_potential","CancerVar:2#Tier_IV_benign","CancerVar:11#Tier_I_strong","CancerVar:2#Tier_IV_benign","CancerVar:2#Tier_IV_benign")
driver_mut_prediction<-c("not protein-affecting","TIER 1","passenger","TIER 2","passenger")
df<-data.frame(CancerVar,driver_mut_prediction)

df %>%
  filter(
    grepl("Tier_I_|Tier_II_", CancerVar) |
    grepl("TIER 1|TIER 2", driver_mut_prediction)
   )
#>                       CancerVar driver_mut_prediction
#> 1 CancerVar:9#Tier_II_potential not protein-affecting
#> 2    CancerVar:2#Tier_IV_benign                TIER 1
#> 3    CancerVar:11#Tier_I_strong             passenger
#> 4    CancerVar:2#Tier_IV_benign                TIER 2

^{由reprex package (v2.0.1) 创建于 2022-04-06}

或者，使用基数 R：

CancerVar<-c("CancerVar:9#Tier_II_potential","CancerVar:2#Tier_IV_benign","CancerVar:11#Tier_I_strong","CancerVar:2#Tier_IV_benign","CancerVar:2#Tier_IV_benign")
driver_mut_prediction<-c("not protein-affecting","TIER 1","passenger","TIER 2","passenger")
df<-data.frame(CancerVar,driver_mut_prediction)

df[grepl("Tier_I_|Tier_II_", df$CancerVar) | grepl("TIER 1|TIER 2", df$driver_mut_prediction),]
#>                       CancerVar driver_mut_prediction
#> 1 CancerVar:9#Tier_II_potential not protein-affecting
#> 2    CancerVar:2#Tier_IV_benign                TIER 1
#> 3    CancerVar:11#Tier_I_strong             passenger
#> 4    CancerVar:2#Tier_IV_benign                TIER 2

^{由reprex package (v2.0.1) 创建于 2022-04-06}

Answer 2

您可以使用str_detect ：

library(tidyverse)

df %>% 
  filter(str_detect(CancerVar, "Tier_I_|Tier_II_") | 
           str_detect(driver_mut_prediction, "TIER 1|TIER 2"))

Output

                      CancerVar driver_mut_prediction
1 CancerVar:9#Tier_II_potential not protein-affecting
2    CancerVar:2#Tier_IV_benign                TIER 1
3    CancerVar:11#Tier_I_strong             passenger
4    CancerVar:2#Tier_IV_benign                TIER 2

数据

df <- structure(list(CancerVar = c("CancerVar:9#Tier_II_potential", 
"CancerVar:2#Tier_IV_benign", "CancerVar:11#Tier_I_strong", "CancerVar:2#Tier_IV_benign", 
"CancerVar:2#Tier_IV_benign"), driver_mut_prediction = c("not protein-affecting", 
"TIER 1", "passenger", "TIER 2", "passenger")), class = "data.frame", row.names = c(NA, 
-5L))

两列的部分字符串匹配 R

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-04-06 03:23:49

解决方案2
1 2022-04-06 03:26:48

两列的部分字符串匹配 R

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-04-06 03:23:49

解决方案2 1 2022-04-06 03:26:48

解决方案1
1 已采纳 2022-04-06 03:23:49

解决方案2
1 2022-04-06 03:26:48