以任意顺序提取2个单词

Question

I would like to extract cat and dog in any order 我想以任何顺序提取猫和狗

string1 <- "aasdfadsf cat asdfadsf dog"
string2 <- "asfdadsfads dog asdfasdfadsf cat"

What I have now extracts cat and dog, but also the text in-between 我现在提取的是猫和狗，还有两者之间的文字

stringr::str_extract(string1, "cat.*dog|dog.*cat"

I would like the output to be 我希望输出是

cat dog

and 和

dog cat

for string1 and string2, respectively 分别用于string1和string2

Answer 1

You may use sub with the following PCRE regex: 您可以将sub与以下PCRE正则表达式一起使用：

.*(?|(dog).*(cat)|(cat).*(dog)).*

See the regex demo . 参见regex演示。

Details 细节

.* - any 0+ chars other than line break chars (to match all chars add (?s) at the pattern start) .* -除换行符以外的任何0+个字符（以匹配所有字符，在模式开始处添加(?s) ）
(?|(dog).*(cat)|(cat).*(dog)) - a branch reset group (?|...|...) matching either of the two alternatives: (?|(dog).*(cat)|(cat).*(dog)) -匹配两个选项之一的分支重置组(?|...|...) ：
- (dog).*(cat) - Group 1 capturing dog , then any 0+ chars as many as possible, and Group 2 capturing cat (dog).*(cat) -组1捕获dog ，然后尽可能多的0+个字符，组2捕获cat
- | - or - 要么
- (cat).*(dog) - Group 1 capturing cat , then any 0+ chars as many as possible, and Group 2 capturing dog (in a branch reset group, group IDs reset to the value before the group + 1) (cat).*(dog) -组1捕获cat ，然后尽可能多的0+个字符，以及组2捕获dog （在分支重置组中，组ID重置为组+ 1之前的值）
.* - any 0+ chars other than line break chars .* -除换行符外的任何0+个字符

The \\1 \\2 replacement pattern inserts Group 1 and Group 2 values into the resulting string (so that the result is just dog or cat , a space, and a cat or dog ). \\1 \\2替换模式将Group 1和Group 2的值插入到结果字符串中（这样，结果就是dog或cat ，一个空格以及cat或dog ）。

See an R demo online , too: 也可以在线观看R演示：

x <- c("aasdfadsf cat asdfadsf dog", "asfdadsfads dog asdfasdfadsf cat")
sub(".*(?|(dog).*(cat)|(cat).*(dog)).*", "\\1 \\2", x, perl=TRUE)
## => [1] "cat dog" "dog cat"

To return NA in case of no match, use a regex to either match the specific pattern, or the whole string, and use it with gsubfn to apply custom replacement logic: 要在不匹配的情况下返回NA ，请使用正则表达式匹配特定模式或整个字符串，并将其与gsubfn配合gsubfn以应用自定义替换逻辑：

> gsubfn("^(?:.*((dog).*(giraffe)|(giraffe).*(dog)).*|.*)$", function(x,a,b,y,z,i) ifelse(nchar(x)>0, paste0(a,y," ",b,z), NA), x)
[1] "NA" "NA"
> gsubfn("^(?:.*((dog).*(cat)|(cat).*(dog)).*|.*)$", function(x,a,b,y,z,i) ifelse(nchar(x)>0, paste0(a,y," ",b,z), NA), x)
[1] "cat dog" "dog cat"

Here, 这里，

^ - start of the string anchor ^ -字符串锚点的开始
(?:.*((dog).*(cat)|(cat).*(dog)).*|.*) - a non-capturing group that matches either of the two alternatives: .*((dog).*(cat)|(cat).*(dog)).* : (?:.*((dog).*(cat)|(cat).*(dog)).*|.*) -与两个选项之一匹配的非捕获组： .*((dog).*(cat)|(cat).*(dog)).* ：
- .* - any 0+ chars as many as possible .* -尽可能多的0个字符
- ((dog).*(cat)|(cat).*(dog)) - a capturing group matching either of the two alternatives: ((dog).*(cat)|(cat).*(dog)) -与两个选项之一匹配的捕获组：
  - (dog).*(cat) - dog (Group 2, assigned to a variable), any 0+ chars as many as possible, and then cat (Group 3, assigned to b variable) (dog).*(cat) - dog （第2组，分配给a变量），任何0+字符尽可能多，然后cat （第3组，分配给b变量）
  - |
  - (cat).*(dog) - dog (Group 4, assigned to y variable), any 0+ chars as many as possible, and then cat (Group 5, assigned to z variable) (cat).*(dog) dog （第4组，分配给y变量），尽可能多的0个字符，然后cat （第5组，分配给z变量）
- .* - any 0+ chars as many as possible .* -尽可能多的0个字符
  - | - or - 要么
  - .* - any 0+ chars .* -任何0+个字符
$ - end of the string anchor . $ -字符串锚点的结尾。

The x in the anonymous function represents the Group 1 value that is "technical" here, we check if the Group 1 match length is not zero with nchar , and if it is not empty we replace with the custom logic, and if the Group 1 is empty, we replace with NA . 匿名函数中的x表示此处的“技术”组1值，我们用nchar检查组1的匹配长度是否不为零，如果不为空，则用自定义逻辑替换，如果组1为空，我们用NA代替。

Answer 2

We can use str_extract_all from the stringr package with capture groups. 我们可以使用str_extract_all从stringr包捕获组。

string1 <- "aasdfadsf cat asdfadsf dog"
string2 <- "asfdadsfads dog asdfasdfadsf cat"
string3 <- "asfdadsfads asfdadsfadf"

library(stringr)
str_extract_all(c(string1, string2, string3), pattern = "(dog)|(cat)")
# [[1]]
# [1] "cat" "dog"
# 
# [[2]]
# [1] "dog" "cat"
# 
# [[3]]
# character(0)

We can also set simplify = TRUE . 我们还可以设置simplify = TRUE 。 The output would be a matrix. 输出将是一个矩阵。

str_extract_all(c(string1, string2, string3), pattern = "(dog)|(cat)", simplify = TRUE)
#       [,1]  [,2] 
# [1,] "cat" "dog"
# [2,] "dog" "cat"
# [3,] ""    ""

Answer 3

Or, 要么，

> regmatches(string1,gregexpr("cat|dog",string1))
[[1]]
[1] "cat" "dog"

> regmatches(string2,gregexpr("cat|dog",string2))
[[1]]
[1] "dog" "cat"

以任意顺序提取2个单词

问题描述

3 个解决方案

解决方案1
3 2018-02-02 21:54:32

解决方案2
2 2018-02-02 21:52:50

解决方案3
1 2018-02-02 22:00:52

以任意顺序提取2个单词

问题描述

3 个解决方案

解决方案1 3 2018-02-02 21:54:32

解决方案2 2 2018-02-02 21:52:50

解决方案3 1 2018-02-02 22:00:52

解决方案1
3 2018-02-02 21:54:32

解决方案2
2 2018-02-02 21:52:50

解决方案3
1 2018-02-02 22:00:52