简体   繁体   中英

How to match different groups in regex

I have the following string:

"Josua de Grave* (1643-1712)"

Everything before the * is the person's name, the first date 1634 is his birth date, 1712 is the date of his death.

Following this logic I'd like to have 3 match groups for each one of the item. I tried

([a-zA-Z|\s]*)\* (\d{3,4})-(\d{3,4})
"Josua de Grave* (1643-1712)".match(/([a-zA-Z|\s]*)\* (\d{3,4})-(\d{3,4})/)

but that returns nil.

Why is my logic wrong, and what should I do to get the 3 intended match groups.

The additional brackets ( ) around the digit 1643-1712 values needs to be added in your regex pattern so use

([a-zA-Z\s]*)\* \((\d{3,4})-(\d{3,4})\)
//               ^^                   ^^

since brackets represents the captured group so escape them using \\ to match them as a character.

While you can use a pattern, the problem of splitting this into its parts can also be easily done using other Ruby methods:

Using split :

s = "Josua de Grave* (1643-1712)"
name, dates = s.split('*') # => ["Josua de Grave", " (1643-1712)"]
birth, death = dates[2..-2].split('-') # => ["1643", "1712"]

Or, using scan :

*name, birth, death = s.scan(/[[:alnum:]]+/) # => ["Josua", "de", "Grave", "1643", "1712"]
name.join(' ')  # => "Josua de Grave"
birth # => "1643"
death # => "1712"

If I was using a pattern, I'd use this:

name, birth, death = /^([^*]+).+?(\d+)-(\d+)/.match(s)[1..3] # => ["Josua de Grave", "1643", "1712"]
name # => "Josua de Grave"
birth # => "1643"
death # => "1712"

/(^[^*]+).+?(\\d+)-(\\d+)/ means:

  • ^ start at the beginning of the buffer
  • ([^*]+) capture everything not * , where it'll stop capturing
  • .+? skip the minimum until...
  • (\\d+) the year is matched and captured
  • - match but don't capture
  • (\\d+) the year is matched and captured

Regexper helps explain it as does Rubular .

r = /\*\s+\(|(?<=\d)\s*-\s*|\)/

"Josua de Grave* (1643-1712)".split r
  #=> ["Josua de Grave", "1643", "1712"] 

"Sir Winston Leonard Spencer-Churchill* (1874 - 1965)".split r
  #=> ["Sir Winston Leonard Spencer-Churchill", "1874", "1965"]

The regular expression can be made self-documenting by writing it in free-spacing mode:

r = /
    \*\s+\(  # match '*' then >= 1 whitespaces then '('
    |        # or
    (?<=\d)  # match is preceded by a digit (positive lookbehind)
    \s*-\s*  # match >= 0 whitespaces then '-' then >= 0 whitespaces 
    |        # or
    \)       # match ')'
    /x       # free-spacing regex definition mode

The positive lookbehind is needed to avoid splitting hyphenated names on hyphens. (The positive lookahead (?=\\d) , placed after \\s*-\\s* , could be used instead.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM