I'd like to extract text between two strings for all occurrences of a pattern. For example, I have this string:
x<- "\nTYPE: School\nCITY: ATLANTA\n\n\nCITY: LAS VEGAS\n\n"
I'd like to extract the words ATLANTA
and LAS VEGAS
as such:
[1] "ATLANTA" "LAS VEGAS"
I tried using gsub(".*CITY:\\\\s|\\n","",x)
. The output this yields is:
[1] " LAS VEGAS"
I would like to output both cities (some patterns in the data include more than 2 cities) and to output them without the leading space.
I also tried the qdapRegex package but could not get close. I am not that good with regular expressions so help would be much appreciated.
Another option:
library(stringr)
str_extract_all(x, "(?<=CITY:\\s{3}).+(?=\\n)")
[[1]]
[1] "ATLANTA" "LAS VEGAS"
reads as: extract anything preceded by "City: " (and three spaces) and followed by "\\n"
You may use
> unlist(regmatches(x, gregexpr("CITY:\\s*\\K.*", x, perl=TRUE)))
[1] "ATLANTA" "LAS VEGAS"
Here, CITY:\\s*\\K.*
regex matches
CITY:
- a literal substring CITY:
\\s*
- 0+ whitespaces \\K
- match reset operator that discards the text matched so far (zeros the current match memory buffer) .*
- any 0+ chars other than line break chars, as many as possible. See the regex demo online .
Note that since it is a PCRE regex, perl=TRUE
is indispensible.
An option can be as:
regmatches(x,gregexpr("(?<=CITY:).*(?=\n\n)",x,perl = TRUE))
# [[1]]
# [1] " ATLANTA" " LAS VEGAS"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.