简体   繁体   English

提取数字之间的单词

[英]Extract words between numbers

Trying to write some regex in R to extract some words between numbers for each string in a character vector in R. Unfortunately, my regex skills aren't nearly up to the challenge. 试图在R中编写一些regex以为R中的字符向量中的每个字符串提取数字之间的一些单词。不幸的是,我的regex技能几乎无法应对挑战。
Here's an example of the problem and my initial attempt: 这是问题的示例,也是我的最初尝试:

x <- c("1 Singleword 1,234 342", "2 randword & thirdword 1,545 323", 
      "3 Anotherword wordagain Newword. 3,234 556")

m <- regexpr("[a-zA-Z]+\\s+", x, perl = TRUE)

regmatches(x, m)

This approach only produces 这种方法只会产生

"Singleword ", "randword ", "Anotherword "

What I need is 我需要的是

"Singleword", "randword & thirdword", "Anotherword wordagain Neword."

I believe it will need to be some kind of regex pattern that will start with a character (like I currently have) and then pull everything until a number is reached. 我相信这将需要某种regex模式,该模式将从字符开始(例如我目前所拥有的字符),然后拉所有内容直到达到数字。

x <- c("1 Singleword 1,234 342", "2 randword & thirdword 1,545 323", 
       "3 Anotherword wordagain Newword. 3,234 556")

m <- regexpr("[a-zA-Z].(\\D)+", x, perl = TRUE)

regmatches(x, m)

[1] "Singleword " "randword & thirdword " [1]“单字”,“ randword和thirdword”
[3] "Anotherword wordagain Newword. " [3]“再次使用“另一个词”。

I used https://regexr.com/ and it's cheatsheet to figure out how to compose the regex. 我使用了https://regexr.com/ ,它是一个速查单,以找出如何组成正则表达式。

Using sub 使用sub

> sub(".\\s(\\D+).*", "\\1", x)
[1] "Singleword "   "randword & thirdword "  "Anotherword wordagain Newword. "

Using str_extract 使用str_extract

> library(stringr)
> str_extract(x, pattern = "\\D+")
[1] " Singleword "  " randword & thirdword "  " Anotherword wordagain Newword. "

sample data 样本数据

x <- c("1 Singleword 1,234 342", "2 randword & thirdword 1,545 323", 
   "3 Anotherword wordagain Newword. 3,234 556")

Base R 基数R

#replace als numbers and comma's with `""` (=nothing), 
# also, trim whitespaces (thanks Markus!)
trimws( gsub( "[0-9,]", "", x ) )

[1] "Singleword" "randword & thirdword" "Anotherword wordagain Newword." [1]“单字”,“ randword和Thirdword”,“ Anotherword word Newword”。

stringR 字符串

library(stringr)
str_extract(x, pattern = "(?<=\\d )[^0-9]+(?= \\d)")

[1] "Singleword" "randword & thirdword" "Anotherword wordagain Newword." [1]“单字”,“ randword和Thirdword”,“ Anotherword word Newword”。

If you like to learn more about (the working of) regex-patterns in the code above (and in the other answers), check out their magic (and explanation) at: https://regex101.com/ 如果您想在上面的代码(以及其他答案)中了解更多关于正则表达式模式(及其工作原理)的信息,请访问以下网址查看其魔术(和说明): https : //regex101.com/

explanation of the last regex-pattern: https://regex101.com/r/QgERuZ/2 最后一个regex模式的说明: https : //regex101.com/r/QgERuZ/2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM