简体   繁体   English

正则表达式提取数字和尾随字母或空格

[英]Regex to extract numbers and trailing letter or white space

I'm currently trying to extract data from strings that are always in the same format (scraped from social sites with no API support) 我目前正在尝试从始终采用相同格式的字符串中提取数据(从没有API支持的社交网站中删除)

example of strings 字符串的例子

53.2k Followers, 11 Following, 1,396 Posts
5m Followers, 83 Following, 1.1m Posts

I'm currently using the following regex expression: "[0-9]{1,5}([,.][0-9]{1,4})?" 我目前正在使用以下正则表达式:“[0-9] {1,5}([,。] [0-9] {1,4})?” to get the numeric sections, preserving the comma and dot separators. 获取数字部分,保留逗号和点分隔符。

It yields results like 它会产生类似的结果

53.2, 11, 1,396 
5, 83, 1.1

I really need a regular expression that will also grab the character after the numeric sections, even if it's a white-space. 我真的需要一个正则表达式,它也会在数字部分之后抓取字符,即使它是一个空格。 ie

53.2k, 11 , 1,396
5m, 83 , 1.1m

Any help is greatly appreciated 任何帮助是极大的赞赏

R code for reproduction R代码用于复制

  library(stringr)

  string1 <- ("536.2k Followers, 83 Following, 1,396 Posts")
  string2 <- ("5m Followers, 83 Following, 1.1m Posts")

  info <- str_extract_all(string1,"[0-9]{1,5}([,.][0-9]{1,4})?")
  info2 <- str_extract_all(string2,"[0-9]{1,5}([,.][0-9]{1,4})?")

  info 
  info2 

I would suggest the following regex pattern: 我会建议以下正则表达式模式:

[0-9]{1,3}(?:,[0-9]{3})*(?:\\.[0-9]+)?[A-Za-z]*

This pattern generates the outputs you expect. 此模式生成您期望的输出。 Here is an explanation: 这是一个解释:

[0-9]{1,3}      match 1 to 3 initial digits
(?:,[0-9]{3})*  followed by zero or more optional thousands groups
(?:\\.[0-9]+)?  followed by an optional decimal component
[A-Za-z]*       followed by an optional text unit

I tend to lean towards base R solutions whenever possible, and here is one using gregexpr and regmatches : 我倾向于尽可能倾向于基础R解决方案,这里有一个使用gregexprregmatches

txt <- "53.2k Followers, 11 Following, 1,396 Posts"
m <- gregexpr("[0-9]{1,3}(?:,[0-9]{3})*(?:\\.[0-9]+)?[A-Za-z]*", txt)
regmatches(txt, m)

[[1]]
[1] "53.2k"   "11"   "1,396"

We can add an optional character argument in the regex 我们可以在正则表达式中添加一个可选的字符参数

stringr::str_extract_all(string1,"[0-9]{1,5}([,.][0-9]{1,4})?[A-Za-z]?")[[1]]
#[1] "536.2k" "83"     "1,396" 
stringr::str_extract_all(string2,"[0-9]{1,5}([,.][0-9]{1,4})?[A-Za-z]?")[[1]]
#[1] "5m"   "83"   "1.1m"

( Updated my earlier post that selected extraneous commas/space) 更新了我之前发布的选择无关逗号/空格的帖子)
This works to meet the OP's requirement to extract trailing letter or white space after the numeric sections (without the extraneous commas and white_spaces of my previous version): 这有助于满足OP trailing letter or white space after the numeric sections提取trailing letter or white space after the numeric sections的要求(没有我以前版本的无关逗号和white_spaces):

(?:[\\d]+[.,]?(?=\\d*)[\\d]*[km ]?) (?:[\\ d] + [。,]?(?= \\ d *)[\\ d] * [km]?)

previous version: \\b(?:[\\d.,]+[km\\s]?) 上一个版本:\\ b(?:[\\ d。,] + [km \\ s]?)

Explanation:  
- (?:          indicates non-capturing group
- [\d]+        matches 1 or more digits
- [.,]?(?=\d*) matches 0 or 1 decimal_point or comma that is immediately followed ("Positive Lookahead") by 1 or more digits
- [\d]*        matches 0 or more digits
- [km\s]?      matches 0 or 1 of characters within []
53.2k Followers, 11 Following, 1,396 Posts     
5m Followers, 83 Following, 1.1m Posts  
# 53.2k; 11 ; 1,396
# 5m; 83 ; 1.1m  

note the spaces matched after 11 and 83, as intended by OP. 注意在OP和11的意图之后,在11和83之后匹配的空格。

Another stringr option: 另一个stringr选项:

new_s<-str_remove_all(unlist(str_extract_all(string2,"\\d{1,}.*\\w")),"[A-Za-z]{2,}")
strsplit(new_s," , ")

    #[[1]]
    #[1] "5m"    "83"    "1.1m "

Original 原版的

str_remove_all(unlist(str_extract_all(string2,"\\d{1,}\\W\\w+")),"[A-Za-z]{2,}")
#[1] "83 "  "1.1m"
str_remove_all(unlist(str_extract_all(string1,"\\d{1,}\\W\\w+")),"[A-Za-z]{2,}")
#[1] "536.2k" "83 "    "1,396" 

If you also want to grap the character after the numeric section even if it is a space, you could use your pattern and an optional character class [mk ]? 如果您还想在数字部分之后绘制字符,即使它是空格,您可以使用您的模式和可选的字符类[mk ]? including the space: 包括空间:

[0-9]{1,5}(?:[,.][0-9]{1,4})?[mk ]?

Regex demo | 正则表达式演示 | R demo R演示

You might expand the range of characters in the the character class to match [a-zA-Z ]? 您可以扩展字符类中的字符范围以匹配[a-zA-Z ]? instead. 代替。 If you want to use a quantifier to match either 1+ times a char OR a single space you could use an alternation: 如果您想使用量词来匹配char或单个空格的1倍以上,您可以使用替换:

[0-9]{1,5}(?:[,.][0-9]{1,4})?(?:[a-zA-Z]+| )?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM