简体   繁体   中英

Insert space after the second or third capital letter python

I have a pandas dataframe containing addresses. Some are formatted correctly like 481 Rogers Rd York ON . Others have a space missing between the city quandrant and the city name, for example: 101 9 Ave SWCalgary AB or even possibly: 101 9 Ave SCalgary AB , where SW refers to south west and S to south.

I'm trying to find a regex that will add a space between second and third capital letters if they are followed by lowercase letters, or if there are only 2 capitals followed by lower case, add a space between the first and second.

So far, I've found that ([AZ]{2,3}[az]) will match the situation correctly, but I can't figure out how to look back into it and sub at position 2 or 3. Ideally, I'd like to use an index to split the match at [-2:] but I can't figure out how to do this.

I found that re.findall('(?<=[AZ][AZ])[AZ][az].+', '101 9 Ave SWCalgary AB') will return the last part of the string and I could use a look forward regex to find the start and then join them but this seems very inefficient.

Thanks

You can use

([A-Z]{1,2})(?=[A-Z][a-z])

to capture the first (or first and second) capital letters, and then use lookahead for a capital letter followed by a lowercase letter. Then, replace with the first group and a space:

re.sub(r'([A-Z]{1,2})(?=[A-Z][a-z])', r'\1 ', str)

https://regex101.com/r/TcB4Ph/1

You may use

df['Test'] = df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')

See this regex demo

Details

  • \\b - a word boundary
  • ([AZ]{1,2}) - Capturing group 1 (later referred with \\1 from the replacement pattern): one or two uppercase letters
  • ([AZ][az]) - Capturing group 2 (later referred with \\2 from the replacement pattern): an uppercase letter + a lowercase one.

If you want to specifically match city quadrants , you may use a bit more specific regex:

df['Test'] = df['Test'].str.replace(r'\b([NS][EW]|[NESW])([A-Z][a-z])', r'\1 \2')

See this regex demo . Here, [NS][EW]|[NESW] matches N or S that are followed with E or W , or a single N , E , S or W .

Pandas demo:

import pandas as pd
df = pd.DataFrame({'Test':['481 Rogers Rd York ON', 
'101 9 Ave SWCalgary AB',
'101 9 Ave SCalgary AB']})
>>> df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
0      481 Rogers Rd York ON
1    101 9 Ave SW Calgary AB
2     101 9 Ave S Calgary AB
Name: Test, dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM