简体   繁体   English

删除字符串中的其他单词,除了需要的单词,包括 R 中的某些模式

[英]Remove others in a string except a needed word including certain patterns in R

I have a vector including certain strings, and I would like remove other parts in each string except the word including certain patter (here is mir ).我有一个包含某些字符串的向量,我想删除每个字符串中的其他部分,除了包含某些模式的单词(这里是mir )。

s <- c("a mir-96 line (kk27)", "mir-133a cell", 
       "d mir-14-3p in", "m mir133 (sas)", "mir_23_5p r 27")

I want to obtain:我想获得:

mir-96 , mir-133a , mir-14-3p , mir133 , mir_23_5p mir-96mir-133amir-14-3pmir133mir_23_5p

I know the idea: use the gsub() and pattern is: a word beginning with (or including) mir .我知道这个想法:使用gsub()和模式是:一个以(或包括) mir开头的词

But I have no idea how to construct such patter.但我不知道如何构建这样的模式。

Or other idea?还是其他想法?

Any help will be appreciated!任何帮助将不胜感激!

One way in base R would be splitting every string into words and then extracting only those with mir in it基础 R 中的一种方法是将每个字符串拆分为单词,然后仅提取其中包含mir单词

unlist(lapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE)))
#[1] "mir-96"    "mir-133a"  "mir-14-3p" "mir133"    "mir_23_5p"

We can save the unlist step in lapply by using sapply as suggested by @Rich Scriven in comments我们可以节省unlist步骤lapply使用sapply由@Rich斯克里文中提出的意见

sapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE))

We can use sub to match zero or more characters ( .* ) followed by a word boundary ( \\\\b ) followed by the string ( mir and one or more characters that are not a white space ( \\\\S+ ), capture it as a group by placing inside the (...) followed by other characters, and in the replacement use the backreference of the captured group ( \\\\1 )我们可以使用sub来匹配零个或多个字符 ( .* ) 后跟单词边界 ( \\\\b ) 后跟字符串( mir和一个或多个不是空格的字符( \\\\S+ ),将其捕获为通过将(...)后跟其他字符放在一个组中,并在替换中使用捕获组的反向引用 ( \\\\1 )

sub(".*\\b(mir\\S+).*", "\\1", s)
#[1] "mir-96"    "mir-133a"  "mir-14-3p" "mir133"    "mir_23_5p"

Update更新

If there are multiple 'mir.*' substring, then we want to extract strings having some numeric part如果有多个 'mir.*' 子字符串,那么我们要提取具有一些数字部分的字符串

sub(".*\\b(mir[^0-9]*[0-9]+\\S*).*", "\\1", s1)
#[1] "mir-96"    "mir-133a"  "mir-14-3p" "mir133"    "mir_23_5p" "mir_23-5p"

data数据

s1 <- c("a mir-96 line (kk27)", "mir-133a cell", "d mir-14-3p in", "m mir133 (sas)", 
                                            "mir_23_5p r 27", "a mir_23-5p 1 mir-net")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM