[英]Remove others in a string except a needed word including certain patterns in R
I have a vector including certain strings, and I would like remove other parts in each string except the word including certain patter (here is mir
).我有一个包含某些字符串的向量,我想删除每个字符串中的其他部分,除了包含某些模式的单词(这里是
mir
)。
s <- c("a mir-96 line (kk27)", "mir-133a cell",
"d mir-14-3p in", "m mir133 (sas)", "mir_23_5p r 27")
I want to obtain:我想获得:
mir-96
, mir-133a
, mir-14-3p
, mir133
, mir_23_5p
mir-96
、 mir-133a
、 mir-14-3p
、 mir133
、 mir_23_5p
I know the idea: use the gsub()
and pattern is: a word beginning with (or including) mir .我知道这个想法:使用
gsub()
和模式是:一个以(或包括) mir开头的词。
But I have no idea how to construct such patter.但我不知道如何构建这样的模式。
Or other idea?还是其他想法?
Any help will be appreciated!任何帮助将不胜感激!
One way in base R would be splitting every string into words and then extracting only those with mir
in it基础 R 中的一种方法是将每个字符串拆分为单词,然后仅提取其中包含
mir
单词
unlist(lapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE)))
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
We can save the unlist
step in lapply
by using sapply
as suggested by @Rich Scriven in comments我们可以节省
unlist
步骤lapply
使用sapply
由@Rich斯克里文中提出的意见
sapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE))
We can use sub
to match zero or more characters ( .*
) followed by a word boundary ( \\\\b
) followed by the string ( mir
and one or more characters that are not a white space ( \\\\S+
), capture it as a group by placing inside the (...)
followed by other characters, and in the replacement use the backreference of the captured group ( \\\\1
)我们可以使用
sub
来匹配零个或多个字符 ( .*
) 后跟单词边界 ( \\\\b
) 后跟字符串( mir
和一个或多个不是空格的字符( \\\\S+
),将其捕获为通过将(...)
后跟其他字符放在一个组中,并在替换中使用捕获组的反向引用 ( \\\\1
)
sub(".*\\b(mir\\S+).*", "\\1", s)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
If there are multiple 'mir.*' substring, then we want to extract strings having some numeric part如果有多个 'mir.*' 子字符串,那么我们要提取具有一些数字部分的字符串
sub(".*\\b(mir[^0-9]*[0-9]+\\S*).*", "\\1", s1)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p" "mir_23-5p"
s1 <- c("a mir-96 line (kk27)", "mir-133a cell", "d mir-14-3p in", "m mir133 (sas)",
"mir_23_5p r 27", "a mir_23-5p 1 mir-net")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.