删除字符串中的其他单词，除了需要的单词，包括 R 中的某些模式

Question

I have a vector including certain strings, and I would like remove other parts in each string except the word including certain patter (here is mir ).我有一个包含某些字符串的向量，我想删除每个字符串中的其他部分，除了包含某些模式的单词（这里是mir ）。

s <- c("a mir-96 line (kk27)", "mir-133a cell", 
       "d mir-14-3p in", "m mir133 (sas)", "mir_23_5p r 27")

I want to obtain:我想获得：

mir-96 , mir-133a , mir-14-3p , mir133 , mir_23_5p mir-96 、 mir-133a 、 mir-14-3p 、 mir133 、 mir_23_5p

I know the idea: use the gsub() and pattern is: a word beginning with (or including) mir .我知道这个想法：使用gsub()和模式是：一个以（或包括） mir开头的词。

But I have no idea how to construct such patter.但我不知道如何构建这样的模式。

Or other idea?还是其他想法？

Any help will be appreciated!任何帮助将不胜感激！

Answer 1

One way in base R would be splitting every string into words and then extracting only those with mir in it基础 R 中的一种方法是将每个字符串拆分为单词，然后仅提取其中包含mir单词

unlist(lapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE)))
#[1] "mir-96"    "mir-133a"  "mir-14-3p" "mir133"    "mir_23_5p"

We can save the unlist step in lapply by using sapply as suggested by @Rich Scriven in comments我们可以节省unlist步骤lapply使用sapply由@Rich斯克里文中提出的意见

sapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE))

Answer 2

We can use sub to match zero or more characters ( .* ) followed by a word boundary ( \\\\b ) followed by the string ( mir and one or more characters that are not a white space ( \\\\S+ ), capture it as a group by placing inside the (...) followed by other characters, and in the replacement use the backreference of the captured group ( \\\\1 )我们可以使用sub来匹配零个或多个字符 ( .* ) 后跟单词边界 ( \\\\b ) 后跟字符串（ mir和一个或多个不是空格的字符（ \\\\S+ ），将其捕获为通过将(...)后跟其他字符放在一个组中，并在替换中使用捕获组的反向引用 ( \\\\1 )

sub(".*\\b(mir\\S+).*", "\\1", s)
#[1] "mir-96"    "mir-133a"  "mir-14-3p" "mir133"    "mir_23_5p"

Update更新

If there are multiple 'mir.*' substring, then we want to extract strings having some numeric part如果有多个 'mir.*' 子字符串，那么我们要提取具有一些数字部分的字符串

sub(".*\\b(mir[^0-9]*[0-9]+\\S*).*", "\\1", s1)
#[1] "mir-96"    "mir-133a"  "mir-14-3p" "mir133"    "mir_23_5p" "mir_23-5p"

data数据

s1 <- c("a mir-96 line (kk27)", "mir-133a cell", "d mir-14-3p in", "m mir133 (sas)", 
                                            "mir_23_5p r 27", "a mir_23-5p 1 mir-net")

删除字符串中的其他单词，除了需要的单词，包括 R 中的某些模式

问题描述

2 个解决方案

解决方案1
2 2017-01-21 01:59:33

解决方案2
1 已采纳 2017-01-21 02:14:10

Update更新

data数据

删除字符串中的其他单词，除了需要的单词，包括 R 中的某些模式

问题描述

2 个解决方案

解决方案1 2 2017-01-21 01:59:33

解决方案2 1 已采纳 2017-01-21 02:14:10

Update更新

data数据

解决方案1
2 2017-01-21 01:59:33

解决方案2
1 已采纳 2017-01-21 02:14:10