[英]Regex: Extracting numbers from parentheses with multiple matches
How do I match the year such that it is general for the following examples. 我如何匹配年份,以便它适用于以下示例。
a <- '"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}'
b <- 'Þegar það gerist (1998/I) (TV)'
I have tried the following, but did not have the biggest success. 我尝试了以下,但没有取得最大的成功。
gsub('.+\\(([0-9]+.+\\)).?$', '\\1', a)
What I thought it did was to go until it finds a (, then it would make a group of numbers, then any character until it meets a ). 我认为它做的是直到它找到一个(然后它会产生一组数字,然后是任何字符,直到它遇到a)。 And if there are several matches, I want to extract the first group.
如果有几个匹配,我想提取第一组。
Any suggestions to where I go wrong? 对我出错的地方有什么建议吗? I have been doing this in R.
我一直在做这个。
You could use 你可以用
library(stringr)
strings <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)')
years <- str_match(strings, "\\((\\d+(?: B\\.C\\.)?)")[,2]
years
# [1] "1953" "1998"
The expression here is 这里的表达是
\( # (
(\d+ # capture 1+ digits
(?: B\.C\.)? # B.C. eventually
)
Note that backslashes need to be escaped in R
. 请注意,反斜杠需要在
R
进行转义。
Your pattern contains .+
parts that match 1 or more chars as many as possible, and at best your pattern could grab last 4 digit chunks from the incoming strings. 你的模式包含
.+
尽可能多地匹配1个或多个字符的部分,最多你的模式可以从传入的字符串中获取最后4位数字块。
You may use 你可以用
^.*?\((\d{4})(?:/[^)]*)?\).*
Replace with \\1
to only keep the 4 digit number. 替换为
\\1
仅保留4位数字。 See the regex demo . 请参阅正则表达式演示 。
Details 细节
^
- start of string ^
- 字符串的开头 .*?
- any 0+ chars as few as possible \\(
- a (
\\(
- (
(\\d{4})
- Group 1: four digits (\\d{4})
- 第1组:四位数 (?:
- start of an optional non-capturing group (?:
- 可选的非捕获组的开始
/
- a /
/
- 一个/
[^)]*
- any 0+ chars other than )
[^)]*
- 除了以外的任何0+字符)
)?
- end of the group \\)
- a )
(OPTIONAL, MAY BE OMITTED) \\)
- a )
(可选,可能省略) .*
- the rest of the string. .*
- 字符串的其余部分。 a <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)', 'Johannes Passion, BWV. 245 (1725 Version) (1996) (V)')
sub("^.*?\\((\\d{4})(?:/[^)]*)?\\).*", "\\1", a)
# => [1] "1953" "1998" "1996"
Another base R solution is to match the 4 digits after (
: 另一个基本R解决方案是匹配4位数后
(
:
regmatches(a, regexpr("\\(\\K\\d{4}(?=(?:/[^)]*)?\\))", a, perl=TRUE))
# => [1] "1953" "1998" "1996"
The \\(\\K\\d{4}
pattern matches (
and then drops it due to \\K
match reset operator and then a (?=(?:/[^)]*)?\\\\))
lookahead ensures there is an optional /
+ 0+ chars other than )
and then a )
. \\(\\K\\d{4}
模式匹配(
然后由于\\K
匹配重置运算符而丢弃它,然后是(?=(?:/[^)]*)?\\\\))
预测确保存在可选/
+ 0+字符以外)
,然后一个)
。 Note that regexpr
extracts the first match only. 请注意,
regexpr
仅提取第一个匹配项。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.