I need to find a regex that will extract the city name from strings below. The order of string is the restaurant name, address, city, phone, cuisine type
Chinois on Main 2709 Main St. Santa Monica 310-392-9025 Pacific New Wave
Benita's Frites 1433 Third St. Promenade Santa Monica 310-458-2889 Fast Food
Indo Cafe 10428 1/2 National Blvd. LA 310-815-1290 Indonesian
Diaghilev 1020 N. San Vicente Blvd. W. Hollywood 310-854-1111 Russian
Jody Maroni's Sausage Kingdom 2011 Ocean Front Walk Venice 310-306-1995 Hot Dogs
I tried this regex, but it doesn't work:
zagat['city'] = zagat['raw'].str.extract("""
((?<=Ave.|Rd.|St.|Blvd.|Dr.|Way.|Pl.|Ln.|Ct.|Beach|Way ).+(?=...-...-....))
""", expand=True)
Can you help?
You may use
rx = r'(?:(?:Ave|Rd|St|Blvd|Dr|Way|Pl|Ln|Ct)\.|Beach|Way|Walk)\s*(.+?)\s*\d{3}-\d{3}-\d{4}'
zagat['city'] = zagat['raw'].str.extract(rx, expand=False)
See the regex demo
Details
(?:(?:Ave|Rd|St|Blvd|Dr|Way|Pl|Ln|Ct)\\.|Beach|Way|Walk)
- Ave
, Rd
, St
, Blvd
, Dr
, Way
, Pl
, Ln
or Ct
followed with .
or Beach
, Way
or Walk
\\s*
- 0+ whitespaces (.+?)
- Group 1 (this value will be returned by .extract
): any one or more chars other than line break chars, as few as possible \\s*
- 0+ whitespaces \\d{3}-\\d{3}-\\d{4}
- 3 digits, -
, 3 digits, -
and 4 digits.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.