[英]How do I extract text between two characters in R
I'd like to extract text between two strings for all occurrences of a pattern. 我想在两个字符串之间为所有出现的模式提取文本。 For example, I have this string: 例如,我有这个字符串:
x<- "\nTYPE: School\nCITY: ATLANTA\n\n\nCITY: LAS VEGAS\n\n"
I'd like to extract the words ATLANTA
and LAS VEGAS
as such: 我想提取ATLANTA
和LAS VEGAS
这样的词:
[1] "ATLANTA" "LAS VEGAS"
I tried using gsub(".*CITY:\\\\s|\\n","",x)
. 我尝试使用gsub(".*CITY:\\\\s|\\n","",x)
。 The output this yields is: 产生的输出是:
[1] " LAS VEGAS"
I would like to output both cities (some patterns in the data include more than 2 cities) and to output them without the leading space. 我想输出两个城市(数据中的一些模式包括超过2个城市)并输出它们而没有前导空格。
I also tried the qdapRegex package but could not get close. 我也尝试过qdapRegex包,但无法接近。 I am not that good with regular expressions so help would be much appreciated. 我对正则表达式不太好,所以非常感谢帮助。
Another option: 另外一个选项:
library(stringr)
str_extract_all(x, "(?<=CITY:\\s{3}).+(?=\\n)")
[[1]]
[1] "ATLANTA" "LAS VEGAS"
reads as: extract anything preceded by "City: " (and three spaces) and followed by "\\n" 读作:提取任何前面带有“City:”(和三个空格)的内容,然后是“\\ n”
You may use 你可以用
> unlist(regmatches(x, gregexpr("CITY:\\s*\\K.*", x, perl=TRUE)))
[1] "ATLANTA" "LAS VEGAS"
Here, CITY:\\s*\\K.*
regex matches 在这里, CITY:\\s*\\K.*
正则表达式匹配
CITY:
- a literal substring CITY:
CITY:
- 一个文字子字符串CITY:
\\s*
- 0+ whitespaces \\s*
- 0+空格 \\K
- match reset operator that discards the text matched so far (zeros the current match memory buffer) \\K
- 匹配重置运算符 ,丢弃到目前为止匹配的文本(当前匹配内存缓冲区为零) .*
- any 0+ chars other than line break chars, as many as possible. .*
- 除了换行符之外的任何0+字符,尽可能多。 See the regex demo online . 在线查看正则表达式演示 。
Note that since it is a PCRE regex, perl=TRUE
is indispensible. 请注意,由于它是PCRE正则表达式,因此perl=TRUE
是必不可少的。
An option can be as: 选项可以是:
regmatches(x,gregexpr("(?<=CITY:).*(?=\n\n)",x,perl = TRUE))
# [[1]]
# [1] " ATLANTA" " LAS VEGAS"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.