简体   繁体   中英

Regex: Extracting numbers from parentheses with multiple matches

How do I match the year such that it is general for the following examples.

a <- '"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}'
b <- 'Þegar það gerist (1998/I) (TV)'

I have tried the following, but did not have the biggest success.

gsub('.+\\(([0-9]+.+\\)).?$', '\\1', a)

What I thought it did was to go until it finds a (, then it would make a group of numbers, then any character until it meets a ). And if there are several matches, I want to extract the first group.

Any suggestions to where I go wrong? I have been doing this in R.

You could use

library(stringr)

strings <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)')

years <- str_match(strings, "\\((\\d+(?: B\\.C\\.)?)")[,2]
years
# [1] "1953" "1998"

The expression here is

\(               # (
(\d+             # capture 1+ digits
    (?: B\.C\.)? # B.C. eventually
)

Note that backslashes need to be escaped in R .

Your pattern contains .+ parts that match 1 or more chars as many as possible, and at best your pattern could grab last 4 digit chunks from the incoming strings.

You may use

^.*?\((\d{4})(?:/[^)]*)?\).*

Replace with \\1 to only keep the 4 digit number. See the regex demo .

Details

  • ^ - start of string
  • .*? - any 0+ chars as few as possible
  • \\( - a (
  • (\\d{4}) - Group 1: four digits
  • (?: - start of an optional non-capturing group
    • / - a /
    • [^)]* - any 0+ chars other than )
  • )? - end of the group
  • \\) - a ) (OPTIONAL, MAY BE OMITTED)
  • .* - the rest of the string.

See the R demo :

a <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)', 'Johannes Passion, BWV. 245 (1725 Version) (1996) (V)')
sub("^.*?\\((\\d{4})(?:/[^)]*)?\\).*", "\\1", a) 
# => [1] "1953" "1998" "1996"

Another base R solution is to match the 4 digits after ( :

regmatches(a, regexpr("\\(\\K\\d{4}(?=(?:/[^)]*)?\\))", a, perl=TRUE))
# => [1] "1953" "1998" "1996"

The \\(\\K\\d{4} pattern matches ( and then drops it due to \\K match reset operator and then a (?=(?:/[^)]*)?\\\\)) lookahead ensures there is an optional / + 0+ chars other than ) and then a ) . Note that regexpr extracts the first match only.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM