[英]Extract 2 words in any order
I would like to extract cat and dog in any order 我想以任何顺序提取猫和狗
string1 <- "aasdfadsf cat asdfadsf dog"
string2 <- "asfdadsfads dog asdfasdfadsf cat"
What I have now extracts cat and dog, but also the text in-between 我现在提取的是猫和狗,还有两者之间的文字
stringr::str_extract(string1, "cat.*dog|dog.*cat"
I would like the output to be 我希望输出是
cat dog
and 和
dog cat
for string1 and string2, respectively 分别用于string1和string2
You may use sub
with the following PCRE regex: 您可以将
sub
与以下PCRE正则表达式一起使用:
.*(?|(dog).*(cat)|(cat).*(dog)).*
See the regex demo . 参见regex演示 。
Details 细节
.*
- any 0+ chars other than line break chars (to match all chars add (?s)
at the pattern start) .*
-除换行符以外的任何0+个字符(以匹配所有字符,在模式开始处添加(?s)
) (?|(dog).*(cat)|(cat).*(dog))
- a branch reset group (?|...|...)
matching either of the two alternatives: (?|(dog).*(cat)|(cat).*(dog))
-匹配两个选项之一的分支重置组(?|...|...)
:
(dog).*(cat)
- Group 1 capturing dog
, then any 0+ chars as many as possible, and Group 2 capturing cat
(dog).*(cat)
-组1捕获dog
,然后尽可能多的0+个字符,组2捕获cat
|
- or (cat).*(dog)
- Group 1 capturing cat
, then any 0+ chars as many as possible, and Group 2 capturing dog
(in a branch reset group, group IDs reset to the value before the group + 1) (cat).*(dog)
-组1捕获cat
,然后尽可能多的0+个字符,以及组2捕获dog
(在分支重置组中,组ID重置为组+ 1之前的值) .*
- any 0+ chars other than line break chars .*
-除换行符外的任何0+个字符 The \\1 \\2
replacement pattern inserts Group 1 and Group 2 values into the resulting string (so that the result is just dog
or cat
, a space, and a cat
or dog
). \\1 \\2
替换模式将Group 1和Group 2的值插入到结果字符串中(这样,结果就是dog
或cat
,一个空格以及cat
或dog
)。
See an R demo online , too: 也可以在线观看R演示 :
x <- c("aasdfadsf cat asdfadsf dog", "asfdadsfads dog asdfasdfadsf cat")
sub(".*(?|(dog).*(cat)|(cat).*(dog)).*", "\\1 \\2", x, perl=TRUE)
## => [1] "cat dog" "dog cat"
To return NA
in case of no match, use a regex to either match the specific pattern, or the whole string, and use it with gsubfn
to apply custom replacement logic: 要在不匹配的情况下返回
NA
,请使用正则表达式匹配特定模式或整个字符串,并将其与gsubfn
配合gsubfn
以应用自定义替换逻辑:
> gsubfn("^(?:.*((dog).*(giraffe)|(giraffe).*(dog)).*|.*)$", function(x,a,b,y,z,i) ifelse(nchar(x)>0, paste0(a,y," ",b,z), NA), x)
[1] "NA" "NA"
> gsubfn("^(?:.*((dog).*(cat)|(cat).*(dog)).*|.*)$", function(x,a,b,y,z,i) ifelse(nchar(x)>0, paste0(a,y," ",b,z), NA), x)
[1] "cat dog" "dog cat"
Here, 这里,
^
- start of the string anchor ^
-字符串锚点的开始 (?:.*((dog).*(cat)|(cat).*(dog)).*|.*)
- a non-capturing group that matches either of the two alternatives: .*((dog).*(cat)|(cat).*(dog)).*
: (?:.*((dog).*(cat)|(cat).*(dog)).*|.*)
-与两个选项之一匹配的非捕获组 : .*((dog).*(cat)|(cat).*(dog)).*
:
.*
- any 0+ chars as many as possible .*
-尽可能多的0个字符 ((dog).*(cat)|(cat).*(dog))
- a capturing group matching either of the two alternatives: ((dog).*(cat)|(cat).*(dog))
-与两个选项之一匹配的捕获组 :
(dog).*(cat)
- dog
(Group 2, assigned to a
variable), any 0+ chars as many as possible, and then cat
(Group 3, assigned to b
variable) (dog).*(cat)
- dog
(第2组,分配给a
变量),任何0+字符尽可能多,然后cat
(第3组,分配给b
变量) |
(cat).*(dog)
- dog
(Group 4, assigned to y
variable), any 0+ chars as many as possible, and then cat
(Group 5, assigned to z
variable) (cat).*(dog)
dog
(第4组,分配给y
变量),尽可能多的0个字符,然后cat
(第5组,分配给z
变量) .*
- any 0+ chars as many as possible .*
-尽可能多的0个字符
|
- or .*
- any 0+ chars .*
-任何0+个字符 $
- end of the string anchor . $
-字符串锚点的结尾。 The x
in the anonymous function represents the Group 1 value that is "technical" here, we check if the Group 1 match length is not zero with nchar
, and if it is not empty we replace with the custom logic, and if the Group 1 is empty, we replace with NA
. 匿名函数中的
x
表示此处的“技术”组1值,我们用nchar
检查组1的匹配长度是否不为零,如果不为空,则用自定义逻辑替换,如果组1为空,我们用NA
代替。
We can use str_extract_all
from the stringr package with capture groups. 我们可以使用
str_extract_all
从stringr包捕获组。
string1 <- "aasdfadsf cat asdfadsf dog"
string2 <- "asfdadsfads dog asdfasdfadsf cat"
string3 <- "asfdadsfads asfdadsfadf"
library(stringr)
str_extract_all(c(string1, string2, string3), pattern = "(dog)|(cat)")
# [[1]]
# [1] "cat" "dog"
#
# [[2]]
# [1] "dog" "cat"
#
# [[3]]
# character(0)
We can also set simplify = TRUE
. 我们还可以设置
simplify = TRUE
。 The output would be a matrix. 输出将是一个矩阵。
str_extract_all(c(string1, string2, string3), pattern = "(dog)|(cat)", simplify = TRUE)
# [,1] [,2]
# [1,] "cat" "dog"
# [2,] "dog" "cat"
# [3,] "" ""
Or, 要么,
> regmatches(string1,gregexpr("cat|dog",string1))
[[1]]
[1] "cat" "dog"
> regmatches(string2,gregexpr("cat|dog",string2))
[[1]]
[1] "dog" "cat"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.