简体   繁体   中英

Extract a string of words between two specific words in R

I have the following string : "PRODUCT colgate good but not goodOKAY"

I want to extract all the words between PRODUCT and OKAY

This can be done with sub :

s <- "PRODUCT colgate good but not goodOKAY"
sub(".*PRODUCT *(.*?) *OKAY.*", "\\1", s)

giving:

[1] "colgate good but not good"

No packages are needed.

Here is a visualization of the regular expression:

.*PRODUCT *(.*?) *OKAY.*

正则表达式可视化

Debuggex Demo

x = "PRODUCT colgate good but not goodOKAY"
library(stringr)
str_extract(string = x, pattern = "(?<=PRODUCT).*(?=OKAY)")

(?<=PRODUCT) -- look behind the match for PRODUCT

.* match everything except new lines.

(?=OKAY) -- look ahead to match OKAY .

I should add you don't need the stringr package for this, the base functions sub and gsub work fine. I use stringr for it's consistency of syntax: whether I'm extracting, replacing, detecting etc. the function names are predictable and understandable, and the arguments are in a consistent order. I use stringr because it saves me from needing the documentation every time.

(Note that for stringr versions less than 1.1.0, you need to specify perl-flavored regex to get lookahead and lookbehind functionality - so the pattern above would need to be wrapped in perl() .)

You can use gsub :

vec <- "PRODUCT colgate good but not goodOKAY"

gsub(".*PRODUCT\\s*|OKAY.*", "", vec)
# [1] "colgate good but not good"

You could use the rm_between function from the qdapRegex package. It takes a string and a left and right boundary as follows:

x <- "PRODUCT colgate good but not goodOKAY"

library(qdapRegex)
rm_between(x, "PRODUCT", "OKAY", extract=TRUE)

## [[1]]
## [1] "colgate good but not good"

You could use the package unglue :

library(unglue)
x <- "PRODUCT colgate good but not goodOKAY"
unglue_vec(x, "PRODUCT {out}OKAY")
#> [1] "colgate good but not good"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM