简体   繁体   中英

Regex Pattern doesn't work using look behind without validating the fixed-width pattern

I need to find a regex that will extract the city name from strings below. The order of string is the restaurant name, address, city, phone, cuisine type

  • Chinois on Main 2709 Main St. Santa Monica 310-392-9025 Pacific New Wave
  • Benita's Frites 1433 Third St. Promenade Santa Monica 310-458-2889 Fast Food
  • Indo Cafe 10428 1/2 National Blvd. LA 310-815-1290 Indonesian
  • Diaghilev 1020 N. San Vicente Blvd. W. Hollywood 310-854-1111 Russian
  • Jody Maroni's Sausage Kingdom 2011 Ocean Front Walk Venice 310-306-1995 Hot Dogs

I tried this regex, but it doesn't work:

zagat['city'] = zagat['raw'].str.extract("""
    ((?<=Ave.|Rd.|St.|Blvd.|Dr.|Way.|Pl.|Ln.|Ct.|Beach|Way ).+(?=...-...-....))
    """, expand=True)

Can you help?

You may use

rx = r'(?:(?:Ave|Rd|St|Blvd|Dr|Way|Pl|Ln|Ct)\.|Beach|Way|Walk)\s*(.+?)\s*\d{3}-\d{3}-\d{4}'
zagat['city'] = zagat['raw'].str.extract(rx, expand=False)

See the regex demo

Details

  • (?:(?:Ave|Rd|St|Blvd|Dr|Way|Pl|Ln|Ct)\\.|Beach|Way|Walk) - Ave , Rd , St , Blvd , Dr , Way , Pl , Ln or Ct followed with . or Beach , Way or Walk
  • \\s* - 0+ whitespaces
  • (.+?) - Group 1 (this value will be returned by .extract ): any one or more chars other than line break chars, as few as possible
  • \\s* - 0+ whitespaces
  • \\d{3}-\\d{3}-\\d{4} - 3 digits, - , 3 digits, - and 4 digits.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM