[英]Insert space after the second or third capital letter python
I have a pandas dataframe containing addresses. 我有一个包含地址的熊猫数据框。 Some are formatted correctly like
481 Rogers Rd York ON
. 有些格式正确,例如
481 Rogers Rd York ON
。 Others have a space missing between the city quandrant and the city name, for example: 101 9 Ave SWCalgary AB
or even possibly: 101 9 Ave SCalgary AB
, where SW
refers to south west and S
to south. 其他人在城市象限和城市名称之间缺少空格,例如:
101 9 Ave SWCalgary AB
甚至可能: 101 9 Ave SCalgary AB
,其中SW
表示西南, S
表示南。
I'm trying to find a regex that will add a space between second and third capital letters if they are followed by lowercase letters, or if there are only 2 capitals followed by lower case, add a space between the first and second. 我试图找到一个正则表达式,如果第二个和第三个大写字母后跟小写字母,或者如果只有2个大写字母后跟小写字母,则在第二个和第三个大写字母之间添加一个空格,请在第一个和第二个之间添加一个空格。
So far, I've found that ([AZ]{2,3}[az])
will match the situation correctly, but I can't figure out how to look back into it and sub at position 2 or 3. Ideally, I'd like to use an index to split the match at [-2:]
but I can't figure out how to do this. 到目前为止,我发现
([AZ]{2,3}[az])
可以正确匹配这种情况,但是我无法弄清楚如何回过头来查看它在位置2或3处的位置。我想使用索引在[-2:]
处分割比赛,但我不知道该怎么做。
I found that re.findall('(?<=[AZ][AZ])[AZ][az].+', '101 9 Ave SWCalgary AB')
will return the last part of the string and I could use a look forward regex to find the start and then join them but this seems very inefficient. 我发现
re.findall('(?<=[AZ][AZ])[AZ][az].+', '101 9 Ave SWCalgary AB')
将返回字符串的最后一部分,我可以使用期待正则表达式找到起点,然后加入他们,但这似乎效率很低。
Thanks 谢谢
You can use 您可以使用
([A-Z]{1,2})(?=[A-Z][a-z])
to capture the first (or first and second) capital letters, and then use lookahead for a capital letter followed by a lowercase letter. 捕获第一个(或第一个和第二个)大写字母,然后使用大写字母lookahead后面跟一个小写字母。 Then, replace with the first group and a space:
然后,用第一组和一个空格替换:
re.sub(r'([A-Z]{1,2})(?=[A-Z][a-z])', r'\1 ', str)
https://regex101.com/r/TcB4Ph/1 https://regex101.com/r/TcB4Ph/1
You may use 您可以使用
df['Test'] = df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
See this regex demo 观看此正则表达式演示
Details 细节
\\b
- a word boundary \\b
单词边界 ([AZ]{1,2})
- Capturing group 1 (later referred with \\1
from the replacement pattern): one or two uppercase letters ([AZ]{1,2})
-捕获组1(后来在替换模式中以\\1
):一个或两个大写字母 ([AZ][az])
- Capturing group 2 (later referred with \\2
from the replacement pattern): an uppercase letter + a lowercase one. ([AZ][az])
-捕获组2(在替换模式中后来用\\2
):大写字母+小写字母。 If you want to specifically match city quadrants , you may use a bit more specific regex: 如果要特别匹配城市象限 ,则可以使用更具体的正则表达式:
df['Test'] = df['Test'].str.replace(r'\b([NS][EW]|[NESW])([A-Z][a-z])', r'\1 \2')
See this regex demo . 请参阅此正则表达式演示 。 Here,
[NS][EW]|[NESW]
matches N
or S
that are followed with E
or W
, or a single N
, E
, S
or W
. 在这里,
[NS][EW]|[NESW]
匹配后跟E
或W
N
或S
或单个N
, E
, S
或W
Pandas demo: 熊猫演示:
import pandas as pd
df = pd.DataFrame({'Test':['481 Rogers Rd York ON',
'101 9 Ave SWCalgary AB',
'101 9 Ave SCalgary AB']})
>>> df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
0 481 Rogers Rd York ON
1 101 9 Ave SW Calgary AB
2 101 9 Ave S Calgary AB
Name: Test, dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.