简体   繁体   中英

Regex to match end of line or whitespace followed by wildcard characters

I have a string where I'm trying to match a city and state with a regular expression in Python. Some of the strings have a final country code that is preceded by a space. I'm having trouble writing a regular expression that matches all the cases, and captures the city in the first capture group, and the state in the second capture g

[^.*]?Born:.*in[^.](.*),[^.*](.*)

This is the regular expression that I have so far, and these are some example strings that I'm trying to match.

  1. Born: November 8, 1961 in Chicago, Illinois
  2. Born: February 19, 1995 in Sombor, Serbia rs
  3. Born: May 19, 1976 in Greenville, South Carolina us

Based on my current regular expression this is my current output:

  1. (Chicago) (Illinois)
  2. (Sombor) (Serbia rs )
  3. (Greenville) (South Carolina us)

Expected outputs would be

  1. (Chicago) (Illinois)
  2. (Sombor) (Serbia)
  3. (Greenville) (South Carolina)

How can I account for this trailing string of a space and two characters? Any help would be greatly spp

Use

Born:.*in\s+([^,]*),\s+(.*?)(?=(?:\s[A-Za-z]{2})?$)

See regex proof .

EXPLANATION

Born: - matches the characters Born: literally (case sensitive)
.* - matches any character (except for line terminators), between zero and unlimited times, as many times as possible, giving back as needed (greedy)
in - matches the characters in literally (case sensitive)
\s+ - matches any whitespace character (equivalent to [\r\n\t\f\v  ]) between one and unlimited times, as many times as possible, giving back as needed (greedy)
1st Capturing Group ([^,]*)
  Match a single character not present in the list below [^,]* between zero and unlimited times, as many times as possible, giving back as needed (greedy)
  , - matches the character , with index 4410 (2C16 or 548) literally (case sensitive)
, -  matches the character , with index 4410 (2C16 or 548) literally (case sensitive)
\s+ - matches any whitespace character (equivalent to [\r\n\t\f\v  ]) between one and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Capturing Group (.*?)
.*? - matches any character (except for line terminators) between zero and unlimited times, as few times as possible, expanding as needed (lazy)
Positive Lookahead (?=(?:\s[A-Za-z]{2})?$)
  Assert that the Regex below matches
  Non-capturing group (?:\s[A-Za-z]{2})?
  ? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
  \s matches any whitespace character (equivalent to [\r\n\t\f\v  ])
  Match a single character present in the list below [A-Za-z]
  {2} matches the previous token exactly 2 times
  A-Z matches a single character in the range between A (index 65) and Z (index 90) 
  (case sensitive)
  a-z matches a single character in the range between a (index 97) and z (index 122) 
  (case sensitive)
  $ asserts position at the end of a line

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM