从R中的列表中除了特定正则表达式之外的所有内容

Question

I want to substitute everything from a list that does NOT match a given pattern. 我想替换一个与给定模式不匹配的列表中的所有内容。 I am using R version 3.1.3 (2015-03-09) -- "Smooth Sidewalk" 我正在使用R版本3.1.3（2015-03-09） - “光滑的人行道”

The example list I have is: 我的示例列表是：

y <- c("D CCNA_01234 This is example 1 bis", "D CCNA_02345 This is example 2", "D CCNA_12345 This is example 3", "D CCNA_23468 This is example 4")

and the pattern I want to match is CCNA_01234 where the numbers are not the same in each case but always are 5 digits. 我要匹配的模式是CCNA_01234，其中数字在每种情况下都不相同，但总是5位数。

The desired output is: 所需的输出是：

"CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

so far I have removed the previous part to the match by: 到目前为止，我已经删除了前一部分的比赛：

y_begin_rm <- sub("D ", "", y)

but I have issues in recognizing the match with the [^match] expression. 但我在识别匹配[^ match]表达式时遇到了问题。

y_CCNA_numbers <- sub("[^CCNA_[0-9][0-9][0-9][0-9][0-9]]*$", "", y_begin_rm)

that produces the output: 产生输出：

[1] "CCNA_01234 This is example 1 bis" "CCNA_02345 This is example 2"
[3] "CCNA_12345 This is example 3" "CCNA_23468 This is example 4"

It seems that the issue is the numbers specified in the match are looked entirely through the string and not in the exact combination that I want. 似乎问题是匹配中指定的数字完全通过字符串查看，而不是我想要的确切组合。 So the number after the phrase "this is example " is making a lot of troubles. 因此，“这是一个例子”之后的数字正在造成很多麻烦。 When I omit the digits or place a digit that is only after the CCNA_string it works just fine: 当我省略数字或放置一个仅在CCNA_string之后的数字时，它可以正常工作：

y_CCNA <- sub("[^CCNA_]*$", "", y_begin_rm)

reults in 报复

[1] "CCNA_" "CCNA_" "CCNA_" "CCNA_"

or 要么

y_CCNA_0 <- sub("[^CCNA_0]*$", "", y_begin_rm[1])

results in 结果是

[1] "CCNA_0"

Is there a way to specify the exact pattern I am looking for (CCNA_[0-9][0-9][0-9][0-9][0-9])? 有没有办法指定我正在寻找的确切模式（CCNA_ [0-9] [0-9] [0-9] [0-9] [0-9]）？ Also, is there a possible way to do it in a single step (remove before and after the match in a single regular expression)? 此外，是否有可能在一个步骤中执行此操作（在单个正则表达式中匹配之前和之后删除）？

Thanks in advance! 提前致谢！

Answer 1

With base R you could simply do directly from your original vector y 使用基数R，您可以直接从原始向量y

sub(".*(CCNA_\\d+).*", "\\1", y)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

Another option is to use stringi 另一种选择是使用stringi

library(stringi)
stri_extract_first_regex(y, "CCNA_\\d+")
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

If you have more than 1 CCNA pattern in each string use stri_extract_all_regex instead 如果每个字符串中有多个CCNA模式，请使用stri_extract_all_regex

In case you want to match exactly 5 digits after CCNA_ you could also do 如果您想在CCNA_之后准确匹配5位数，您也可以这样做

stri_extract_first_regex(y, "CCNA_\\d{5}")
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

And of course similarly with stringr 当然与stringr类似

library(stringr)
str_extract(y, "CCNA_\\d{5}")
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

Answer 2

Here are a few ways: 以下是几种方法：

1) strapplyc . 1）strapplyc 。 This uses a particularly simple pattern. 这使用了一种特别简单的模式。 It makes use of strapplyc in the gsubfn package: 它在gsubfn包中使用了strapplyc ：

library(gsubfn)
strapplyc(y, "CCNA_\\d{5}", simplify = TRUE)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

Here is a visualization of the regular expression: 这是正则表达式的可视化：

CCNA_\d{5}

正则表达式可视化

Debuggex Demo Debuggex演示

1a) If the only occurrences of CCNA_ are before 5 digits then we can simplify the previous solution slightly like this: 1a）如果CCNA_的唯一出现在5位之前，那么我们可以稍微简化以前的解决方案：

strapplyc(y, "CCNA_.{5}", simplify = TRUE)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

2) sub . 2）分 。 The pattern here is slightly more complicated but using sub we can do it without any addon packages: 这里的模式稍微复杂一些，但使用sub我们可以在没有任何插件包的情况下完成：

sub(".*(CCNA_\\d{5}).*", "\\1", y)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

3) strsplit If the portion wanted is always the second "word" (which is the case in the question) then this would work and again requires no packages: 3）strsplit如果所需的部分总是第二个“单词”（在问题中是这种情况）那么这将起作用并且再次不需要包：

sapply(strsplit(y, " "), "[", 2)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

4) substr If the desired portion is always characters 3 through 12 as it is in the question then we could use substr or substring , again, without any packages: 4）substr如果所需部分始终是问题中的字符3到12，那么我们可以再次使用substr或substring ，而不使用任何包：

substr(y, 3, 12)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

Answer 3

Here's an approach using a package I maintain qdapRegex (I prefer this or stringi/stringr) to base for consistency and ease of use. 这是一种使用包维护qdapRegex （我更喜欢这个或stringi / stringr）的方法，以确保一致性和易用性。 I also show a base approach. 我还展示了一种基本方法。 In any event I'd look at this more as an "extraction" problem than a "sub everything but" subbing problem. 在任何情况下，我都将此视为一个“提取”问题，而不是“除了一切”之外的问题。

y <- c("D CCNA_01234 This is example 1 bis", "D CCNA_02345 This is example 2", 
    "D CCNA_12345 This is example 3", "D CCNA_23468 This is example 4")

library(qdapRegex)
unlist(rm_default(y, pattern = "CCNA_\\d{5}", extract = TRUE))

## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

In base R: 在基地R：

unlist(regmatches(y, gregexpr("CCNA_\\d{5}", y)))

## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

从R中的列表中除了特定正则表达式之外的所有内容

问题描述

3 个解决方案

解决方案1
5 2015-05-03 13:03:26

解决方案2
5 已采纳 2015-05-03 13:15:49

解决方案3
4 2015-05-03 13:14:48

从R中的列表中除了特定正则表达式之外的所有内容

问题描述

3 个解决方案

解决方案1 5 2015-05-03 13:03:26

解决方案2 5 已采纳 2015-05-03 13:15:49

解决方案3 4 2015-05-03 13:14:48

解决方案1
5 2015-05-03 13:03:26

解决方案2
5 已采纳 2015-05-03 13:15:49

解决方案3
4 2015-05-03 13:14:48