[英]Extract words between numbers
Trying to write some regex
in R to extract some words between numbers for each string in a character vector in R. Unfortunately, my regex
skills aren't nearly up to the challenge. 试图在R中编写一些regex
以为R中的字符向量中的每个字符串提取数字之间的一些单词。不幸的是,我的regex
技能几乎无法应对挑战。
Here's an example of the problem and my initial attempt: 这是问题的示例,也是我的最初尝试:
x <- c("1 Singleword 1,234 342", "2 randword & thirdword 1,545 323",
"3 Anotherword wordagain Newword. 3,234 556")
m <- regexpr("[a-zA-Z]+\\s+", x, perl = TRUE)
regmatches(x, m)
This approach only produces 这种方法只会产生
"Singleword ", "randword ", "Anotherword "
What I need is 我需要的是
"Singleword", "randword & thirdword", "Anotherword wordagain Neword."
I believe it will need to be some kind of regex
pattern that will start with a character (like I currently have) and then pull everything until a number is reached. 我相信这将需要某种regex
模式,该模式将从字符开始(例如我目前所拥有的字符),然后拉所有内容直到达到数字。
x <- c("1 Singleword 1,234 342", "2 randword & thirdword 1,545 323",
"3 Anotherword wordagain Newword. 3,234 556")
m <- regexpr("[a-zA-Z].(\\D)+", x, perl = TRUE)
regmatches(x, m)
[1] "Singleword " "randword & thirdword " [1]“单字”,“ randword和thirdword”
[3] "Anotherword wordagain Newword. " [3]“再次使用“另一个词”。
I used https://regexr.com/ and it's cheatsheet to figure out how to compose the regex. 我使用了https://regexr.com/ ,它是一个速查单,以找出如何组成正则表达式。
Using sub
使用sub
> sub(".\\s(\\D+).*", "\\1", x)
[1] "Singleword " "randword & thirdword " "Anotherword wordagain Newword. "
Using str_extract
使用str_extract
> library(stringr)
> str_extract(x, pattern = "\\D+")
[1] " Singleword " " randword & thirdword " " Anotherword wordagain Newword. "
sample data 样本数据
x <- c("1 Singleword 1,234 342", "2 randword & thirdword 1,545 323",
"3 Anotherword wordagain Newword. 3,234 556")
Base R 基数R
#replace als numbers and comma's with `""` (=nothing),
# also, trim whitespaces (thanks Markus!)
trimws( gsub( "[0-9,]", "", x ) )
[1] "Singleword" "randword & thirdword" "Anotherword wordagain Newword." [1]“单字”,“ randword和Thirdword”,“ Anotherword word Newword”。
stringR 字符串
library(stringr)
str_extract(x, pattern = "(?<=\\d )[^0-9]+(?= \\d)")
[1] "Singleword" "randword & thirdword" "Anotherword wordagain Newword." [1]“单字”,“ randword和Thirdword”,“ Anotherword word Newword”。
If you like to learn more about (the working of) regex-patterns in the code above (and in the other answers), check out their magic (and explanation) at: https://regex101.com/ 如果您想在上面的代码(以及其他答案)中了解更多关于正则表达式模式(及其工作原理)的信息,请访问以下网址查看其魔术(和说明): https : //regex101.com/
explanation of the last regex-pattern: https://regex101.com/r/QgERuZ/2 最后一个regex模式的说明: https : //regex101.com/r/QgERuZ/2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.