[英]How to parse an address string to street and house number
So I want to separate the street and house number from the address line.所以我想将街道和门牌号与地址行分开。 I can split the address based on the last space (my code below).
我可以根据最后一个空格拆分地址(我的代码如下)。 But this won't help for the case in line 3, where the house number actually contains space.
但这对于第 3 行的情况没有帮助,其中门牌号实际上包含空格。
address street house_number
my street 6 my street 6
my street 10a my street 10a
next street 5 c next street 5 c
next street100 next street 100
My best try, which does not help with the 3rd case:我最好的尝试,这对第三种情况没有帮助:
df['street'] = df['address'].apply(lambda x: ' '.join(x.split(' ')[:-1]))
df['house_number'] = df['address'].apply(lambda x: x.split(' ')[-1])
My idea would be: identify the first digit (number) in the string and split the string in 2 parts from there.我的想法是:识别字符串中的第一个数字(数字)并将字符串从那里分成两部分。 Regex?
正则表达式? I tried but no solution
我试过但没有解决办法
Code for reproduction复制代码
data = {'address': ['my street 6', 'my street 10a', 'next street 5 c', 'next street100'],
'street': ['my street', 'my street', 'next street', 'next street'],
'house_number': ['6', '10a', '5 c', '100']
}
df = pd.DataFrame(data)
EDITED: 4th case added已编辑:添加了第 4 个案例
I think this will do;我认为这可以; Use.str.split() to split by the space that comes before the digit
使用.str.split() 按数字前的空格进行分割
Data数据
df=pd.DataFrame({'address':['my street 6','my street 10a','next street 5 c']})
Solution解决方案
df.address.str.split('\s(?=\d)', expand=True).rename(columns={0:'street',1:'house_number'})
Outcome结果
street house_number
0 my street 6
1 my street 10a
2 next street 5 c
If you wanted to include the original column.如果您想包含原始列。 Please try;
请试试;
df1=df.join(df.address.str.split('\s(?=\d)', expand=True).rename(columns={0:'street',1:'house_number'}))
address street house_number
0 my street 6 my street 6
1 my street 10a my street 10a
2 next street 5 c next street 5 c
RegEx explaination正则表达式解释
The RegEx looks for the position of the space (\s), with a condition (?= ) that a digit (\d) would follow it (?=\d) RegEx 查找空间 (\s) 的 position,条件是 (?= ) 后面跟着一个数字 (\d) (?=\d)
For the 4th case in my question, this is the solution I came up with:对于我的问题中的第 4 种情况,这是我提出的解决方案:
df['street'] = df.address.str.split('\d', expand=True)[0]
df['house_number'] = df.address.str.split('.(?=\d)', n=1, expand=True)[1]
So the logic for the street is simply everything prior to the first number in the string.所以街道的逻辑就是字符串中第一个数字之前的所有内容。 For the house number, I split from the character left from the 1st digit found, and limit the split to 2 parts (part 0 & 1, thats why n=1 instead of 2 for 2 parts).
对于门牌号码,我从找到的第一个数字左侧的字符开始拆分,并将拆分限制为 2 部分(部分 0 和 1,这就是为什么 n=1 而不是 2 的 2 部分)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.