I would like to extract cat and dog in any order
string1 <- "aasdfadsf cat asdfadsf dog"
string2 <- "asfdadsfads dog asdfasdfadsf cat"
What I have now extracts cat and dog, but also the text in-between
stringr::str_extract(string1, "cat.*dog|dog.*cat"
I would like the output to be
cat dog
and
dog cat
for string1 and string2, respectively
You may use sub
with the following PCRE regex:
.*(?|(dog).*(cat)|(cat).*(dog)).*
See the regex demo .
Details
.*
- any 0+ chars other than line break chars (to match all chars add (?s)
at the pattern start) (?|(dog).*(cat)|(cat).*(dog))
- a branch reset group (?|...|...)
matching either of the two alternatives:
(dog).*(cat)
- Group 1 capturing dog
, then any 0+ chars as many as possible, and Group 2 capturing cat
|
- or (cat).*(dog)
- Group 1 capturing cat
, then any 0+ chars as many as possible, and Group 2 capturing dog
(in a branch reset group, group IDs reset to the value before the group + 1) .*
- any 0+ chars other than line break chars The \\1 \\2
replacement pattern inserts Group 1 and Group 2 values into the resulting string (so that the result is just dog
or cat
, a space, and a cat
or dog
).
See an R demo online , too:
x <- c("aasdfadsf cat asdfadsf dog", "asfdadsfads dog asdfasdfadsf cat")
sub(".*(?|(dog).*(cat)|(cat).*(dog)).*", "\\1 \\2", x, perl=TRUE)
## => [1] "cat dog" "dog cat"
To return NA
in case of no match, use a regex to either match the specific pattern, or the whole string, and use it with gsubfn
to apply custom replacement logic:
> gsubfn("^(?:.*((dog).*(giraffe)|(giraffe).*(dog)).*|.*)$", function(x,a,b,y,z,i) ifelse(nchar(x)>0, paste0(a,y," ",b,z), NA), x)
[1] "NA" "NA"
> gsubfn("^(?:.*((dog).*(cat)|(cat).*(dog)).*|.*)$", function(x,a,b,y,z,i) ifelse(nchar(x)>0, paste0(a,y," ",b,z), NA), x)
[1] "cat dog" "dog cat"
Here,
^
- start of the string anchor (?:.*((dog).*(cat)|(cat).*(dog)).*|.*)
- a non-capturing group that matches either of the two alternatives: .*((dog).*(cat)|(cat).*(dog)).*
:
.*
- any 0+ chars as many as possible ((dog).*(cat)|(cat).*(dog))
- a capturing group matching either of the two alternatives:
(dog).*(cat)
- dog
(Group 2, assigned to a
variable), any 0+ chars as many as possible, and then cat
(Group 3, assigned to b
variable) |
(cat).*(dog)
- dog
(Group 4, assigned to y
variable), any 0+ chars as many as possible, and then cat
(Group 5, assigned to z
variable) .*
- any 0+ chars as many as possible
|
- or .*
- any 0+ chars $
- end of the string anchor . The x
in the anonymous function represents the Group 1 value that is "technical" here, we check if the Group 1 match length is not zero with nchar
, and if it is not empty we replace with the custom logic, and if the Group 1 is empty, we replace with NA
.
We can use str_extract_all
from the stringr package with capture groups.
string1 <- "aasdfadsf cat asdfadsf dog"
string2 <- "asfdadsfads dog asdfasdfadsf cat"
string3 <- "asfdadsfads asfdadsfadf"
library(stringr)
str_extract_all(c(string1, string2, string3), pattern = "(dog)|(cat)")
# [[1]]
# [1] "cat" "dog"
#
# [[2]]
# [1] "dog" "cat"
#
# [[3]]
# character(0)
We can also set simplify = TRUE
. The output would be a matrix.
str_extract_all(c(string1, string2, string3), pattern = "(dog)|(cat)", simplify = TRUE)
# [,1] [,2]
# [1,] "cat" "dog"
# [2,] "dog" "cat"
# [3,] "" ""
Or,
> regmatches(string1,gregexpr("cat|dog",string1))
[[1]]
[1] "cat" "dog"
> regmatches(string2,gregexpr("cat|dog",string2))
[[1]]
[1] "dog" "cat"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.