[英]How to use grepl function multiple times, in R
I have a vector like go_id
and a data.frame like data
.我有一个像go_id
这样的向量和一个像data
一样的 data.frame 。
go_id <- c("[GO:0000086]", "[GO:0000209]", "[GO:0000278]")
protein_id <- c("Q96IF1","P26371","Q8NHG8","P60372","O75526","Q01130")
bio_process <- c("[GO:0000086]; [GO:0000122]; [GO:0000932]", "[GO:0005829]; [GO:0008544]","[GO:0000209]; [GO:0005737]; [GO:0005765]","NA","[GO:0000398]; [GO:0003729]","[GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714]")
data <- as.data.frame(cbind(protein_id,bio_process))
How can I keep the rows of the data
for which bio_process
cell contains at least one of the go_ids
elements?如何保留bio_process
单元格至少包含一个go_ids
元素的data
行? I note that the GO code can not be repeated in the same bio_process
cell.我注意到 GO 代码不能在同一个bio_process
单元中重复。
To be more precise, i would like to receive only the first, the third and the sixth row of the data.frame.更准确地说,我只想接收 data.frame 的第一行、第三行和第六行。
I have tried a for loop
using 'grepl' function, like this:我尝试了一个使用'grepl' function的for loop
,如下所示:
go_id <- gsub("GO:","", go_id, fixed = TRUE)
for (i in 1:6) {
new_data <- data[grepl("\\[GO:go_id[i]\\]",data$Gene.ontology..biological.process.)]
}
Which I know it can not work because I can not fit in a variable value into a regular expression.我知道它不起作用,因为我无法将变量值放入正则表达式中。
Any ideas on this?对此有什么想法吗? Thank you谢谢
We can use Reduce
with grepl
我们可以在grepl
中使用Reduce
data$ind <- Reduce(`|`, lapply(go_id, function(pat)
grepl(pat, data$bio_process, fixed = TRUE)))
data
# protein_id bio_process ind
#1 Q96IF1 [GO:0000086]; [GO:0000122]; [GO:0000932] TRUE
#2 P26371 [GO:0005829]; [GO:0008544] FALSE
#3 Q8NHG8 [GO:0000209]; [GO:0005737]; [GO:0005765] TRUE
#4 P60372 NA FALSE
#5 O75526 [GO:0000398]; [GO:0003729] FALSE
#6 Q01130 [GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714] TRUE
You should use fixed = TRUE
in grepl()
:您应该在grepl()
中使用fixed = TRUE
:
vect <- rep(FALSE, nrow(data))
for(id in go_id){
vect <- vect | grepl(id, data$bio_process, fixed = T)
}
data[vect,]
You can subset using str_extract
to define the pattern on those substrings that are distinctive:您可以使用str_extract
子集来定义那些独特的子字符串上的模式:
library(stringr)
data[grepl(paste(str_extract(go_id, "\\d{4}]"), collapse="|"), data$bio_process),]
protein_id bio_process
1 Q96IF1 [GO:0000086]; [GO:0000122]; [GO:0000932]
3 Q8NHG8 [GO:0000209]; [GO:0005737]; [GO:0005765]
6 Q01130 [GO:0000278]; [GO:0000381]; [GO:0000398]; [GO:0003714]
EDIT :编辑:
The most straighforward solution is subsetting with grepl
and paste0
to add the escape slashes for the metacharacter [
:最直接的解决方案是使用grepl
和paste0
进行子集化,以添加元字符[
的转义斜线:
data[grepl(paste0("\\", go_id, collapse="|"), data$bio_process),]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.