简体   繁体   English

如何在 R 中多次使用 grepl function

[英]How to use grepl function multiple times, in R

I have a vector like go_id and a data.frame like data .我有一个像go_id这样的向量和一个像data一样的 data.frame 。

go_id <- c("[GO:0000086]", "[GO:0000209]", "[GO:0000278]")


protein_id <- c("Q96IF1","P26371","Q8NHG8","P60372","O75526","Q01130")
bio_process <- c("[GO:0000086]; [GO:0000122]; [GO:0000932]", "[GO:0005829]; [GO:0008544]","[GO:0000209]; [GO:0005737]; [GO:0005765]","NA","[GO:0000398]; [GO:0003729]","[GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714]")
data <- as.data.frame(cbind(protein_id,bio_process))

How can I keep the rows of the data for which bio_process cell contains at least one of the go_ids elements?如何保留bio_process单元格至少包含一个go_ids元素的data行? I note that the GO code can not be repeated in the same bio_process cell.我注意到 GO 代码不能在同一个bio_process单元中重复。

To be more precise, i would like to receive only the first, the third and the sixth row of the data.frame.更准确地说,我只想接收 data.frame 的第一行、第三行和第六行。

I have tried a for loop using 'grepl' function, like this:我尝试了一个使用'grepl' function的for loop ,如下所示:

go_id <- gsub("GO:","", go_id, fixed = TRUE)
for (i in 1:6) {
  new_data <- data[grepl("\\[GO:go_id[i]\\]",data$Gene.ontology..biological.process.)]
  }

Which I know it can not work because I can not fit in a variable value into a regular expression.我知道它不起作用,因为我无法将变量值放入正则表达式中。

Any ideas on this?对此有什么想法吗? Thank you谢谢

We can use Reduce with grepl我们可以在grepl中使用Reduce

data$ind <-  Reduce(`|`, lapply(go_id, function(pat) 
           grepl(pat, data$bio_process, fixed = TRUE)))

data
#  protein_id                                            bio_process   ind
#1     Q96IF1               [GO:0000086]; [GO:0000122]; [GO:0000932]  TRUE
#2     P26371                             [GO:0005829]; [GO:0008544] FALSE
#3     Q8NHG8               [GO:0000209]; [GO:0005737]; [GO:0005765]  TRUE
#4     P60372                                                     NA FALSE
#5     O75526                             [GO:0000398]; [GO:0003729] FALSE
#6     Q01130 [GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714]  TRUE

You should use fixed = TRUE in grepl() :您应该在grepl()中使用fixed = TRUE

vect <- rep(FALSE, nrow(data))
for(id in go_id){
  vect <- vect | grepl(id, data$bio_process, fixed = T)
}
data[vect,]

You can subset using str_extract to define the pattern on those substrings that are distinctive:您可以使用str_extract子集来定义那些独特的子字符串上的模式:

library(stringr)
data[grepl(paste(str_extract(go_id, "\\d{4}]"), collapse="|"),  data$bio_process),]
  protein_id                                            bio_process
1     Q96IF1               [GO:0000086]; [GO:0000122]; [GO:0000932]
3     Q8NHG8               [GO:0000209]; [GO:0005737]; [GO:0005765]
6     Q01130 [GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714]

EDIT :编辑

The most straighforward solution is subsetting with grepl and paste0 to add the escape slashes for the metacharacter [ :最直接的解决方案是使用greplpaste0进行子集化,以添加元字符[的转义斜线:

data[grepl(paste0("\\", go_id, collapse="|"),  data$bio_process),]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM