简体   繁体   中英

Insert space after the second or third capital letter python

I have a pandas dataframe containing addresses. Some are formatted correctly like 481 Rogers Rd York ON . Others have a space missing between the city quandrant and the city name, for example: 101 9 Ave SWCalgary AB or even possibly: 101 9 Ave SCalgary AB , where SW refers to south west and S to south.

I'm trying to find a regex that will add a space between second and third capital letters if they are followed by lowercase letters, or if there are only 2 capitals followed by lower case, add a space between the first and second.

So far, I've found that ([AZ]{2,3}[az]) will match the situation correctly, but I can't figure out how to look back into it and sub at position 2 or 3. Ideally, I'd like to use an index to split the match at [-2:] but I can't figure out how to do this.

I found that re.findall('(?<=[AZ][AZ])[AZ][az].+', '101 9 Ave SWCalgary AB') will return the last part of the string and I could use a look forward regex to find the start and then join them but this seems very inefficient.


You can use


to capture the first (or first and second) capital letters, and then use lookahead for a capital letter followed by a lowercase letter. Then, replace with the first group and a space:

re.sub(r'([A-Z]{1,2})(?=[A-Z][a-z])', r'\1 ', str)


You may use

df['Test'] = df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')

See this regex demo


  • \\b - a word boundary
  • ([AZ]{1,2}) - Capturing group 1 (later referred with \\1 from the replacement pattern): one or two uppercase letters
  • ([AZ][az]) - Capturing group 2 (later referred with \\2 from the replacement pattern): an uppercase letter + a lowercase one.

If you want to specifically match city quadrants , you may use a bit more specific regex:

df['Test'] = df['Test'].str.replace(r'\b([NS][EW]|[NESW])([A-Z][a-z])', r'\1 \2')

See this regex demo . Here, [NS][EW]|[NESW] matches N or S that are followed with E or W , or a single N , E , S or W .

Pandas demo:

import pandas as pd
df = pd.DataFrame({'Test':['481 Rogers Rd York ON', 
'101 9 Ave SWCalgary AB',
'101 9 Ave SCalgary AB']})
>>> df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
0      481 Rogers Rd York ON
1    101 9 Ave SW Calgary AB
2     101 9 Ave S Calgary AB
Name: Test, dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM