简体   繁体   English

从 R 中的字符串中提取数字和文本

[英]Extracting numbers and text from string in R

I have a string and would like to extract the first sets of three numbers and any three letters next to each number and then put into a vector.我有一个字符串,想提取前一组三个数字和每个数字旁边的任意三个字母,然后放入一个向量中。 So this:所以这:

t1 <- "The string contains numbers ranging from 3-4 cm and can reach up to 5.6 m long, and sometimes can even reach 10 m."

t1 would become: t1 将变为:

"3-4 cm", "5.6 m", "10m"

I have looked up various regular expression functions like grep, grepl etc., but can't find example that matches my query.我查找了各种正则表达式函数,如 grep、grepl 等,但找不到与我的查询匹配的示例。 Any suggestions?有什么建议?

Here's how this can be done with gregexpr() + regmatches() :以下是如何使用gregexpr() + regmatches()完成此操作:

ipartRE <- '\\d+';
fpartRE <- '\\.\\d+';
numRE <- paste0(ipartRE,'(?:',fpartRE,')?');
rangeRE <- paste0(numRE,'(?:\\s*-\\s*',numRE,')?');
pat <- paste0(rangeRE,'\\s*[a-zA-Z]{1,3}\\b');
regmatches(t1,gregexpr(perl=T,pat,t1))[[1L]];
## [1] "3-4 cm" "5.6 m"  "10 m"

I built up the regex incrementally from component parts for human readability, but obviously you don't need to do that.我从组成部分逐步构建了正则表达式以提高可读性,但显然您不需要这样做。


To match the new pattern we need to accept an alternation for the second number which takes matching parentheses around the number.为了匹配新模式,我们需要接受第二个数字的替代,该数字在数字周围加上匹配的括号。 I also found that the dash in 120(–150) cm in not a normal ASCII hyphen , but is an en dash , and so I added another precomputed regular expression piece called dashRE which matches all 3 common dash types (ASCII, en dash, and em dash ):我还发现120(–150) cm中的破折号不是普通的ASCII 连字符,而是一个短划线,因此我添加了另一个名为dashRE预计算正则表达式,它匹配所有 3 种常见的破折号类型(ASCII、短划线、和破折号):

ipartRE <- '\\d+';
fpartRE <- '\\.\\d+';
numRE <- paste0(ipartRE,'(?:',fpartRE,')?');
dashRE <- '[—–-]';
rangeOptParenRE <- paste0(numRE,'(?:\\s*(?:',dashRE,'\\s*',numRE,'|\\(\\s*',dashRE,'\\s*',numRE,'\\s*\\)\\s*))?');
pat <- paste0(rangeOptParenRE,'\\s*[a-zA-Z]{1,3}\\b');
regmatches(t1,gregexpr(perl=T,pat,t1))[[1L]];
## [1] "3-4 cm"       "120(–150) cm" "5.6 m"        "10 m"

You can try this regular expression [0-9.-]+\\\\s+[a-zA-z]{1,3} and use the str_extract_all from stringr package to extract them:可以尝试此正则表达式[0-9.-]+\\\\s+[a-zA-z]{1,3}并使用str_extract_allstringr包提取它们:

stringr::str_extract_all(t1, "[0-9.-]+\\s+[a-zA-Z]{1,3}")
[[1]]
[1] "3-4 cm" "5.6 m"  "10 m"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM