简体   繁体   English

如何将地址字符串解析为街道和门牌号

[英]How to parse an address string to street and house number

So I want to separate the street and house number from the address line.所以我想将街道和门牌号与地址行分开。 I can split the address based on the last space (my code below).我可以根据最后一个空格拆分地址(我的代码如下)。 But this won't help for the case in line 3, where the house number actually contains space.但这对于第 3 行的情况没有帮助,其中门牌号实际上包含空格。

address             street          house_number
my street 6         my street       6
my street 10a       my street       10a
next street 5 c     next street     5 c
next street100      next street     100

My best try, which does not help with the 3rd case:我最好的尝试,这对第三种情况没有帮助:

df['street'] = df['address'].apply(lambda x: ' '.join(x.split(' ')[:-1]))
df['house_number'] = df['address'].apply(lambda x: x.split(' ')[-1])

My idea would be: identify the first digit (number) in the string and split the string in 2 parts from there.我的想法是:识别字符串中的第一个数字(数字)并将字符串从那里分成两部分。 Regex?正则表达式? I tried but no solution我试过但没有解决办法

Code for reproduction复制代码

data = {'address': ['my street 6', 'my street 10a', 'next street 5 c', 'next street100'],
        'street': ['my street', 'my street', 'next street', 'next street'],
        'house_number': ['6', '10a', '5 c', '100']
        }
df = pd.DataFrame(data)

EDITED: 4th case added已编辑:添加了第 4 个案例

I think this will do;我认为这可以; Use.str.split() to split by the space that comes before the digit使用.str.split() 按数字前的空格进行分割

Data数据

df=pd.DataFrame({'address':['my street 6','my street 10a','next street 5 c']})

Solution解决方案

df.address.str.split('\s(?=\d)', expand=True).rename(columns={0:'street',1:'house_number'})

Outcome结果

      street        house_number
0    my street            6
1    my street          10a
2  next street          5 c

If you wanted to include the original column.如果您想包含原始列。 Please try;请试试;

df1=df.join(df.address.str.split('\s(?=\d)', expand=True).rename(columns={0:'street',1:'house_number'}))



        address       street     house_number
0      my street 6    my street            6
1    my street 10a    my street          10a
2  next street 5 c  next street          5 c

RegEx explaination正则表达式解释

The RegEx looks for the position of the space (\s), with a condition (?= ) that a digit (\d) would follow it (?=\d) RegEx 查找空间 (\s) 的 position,条件是 (?= ) 后面跟着一个数字 (\d) (?=\d)

For the 4th case in my question, this is the solution I came up with:对于我的问题中的第 4 种情况,这是我提出的解决方案:

df['street'] = df.address.str.split('\d', expand=True)[0]
df['house_number'] = df.address.str.split('.(?=\d)', n=1, expand=True)[1]

So the logic for the street is simply everything prior to the first number in the string.所以街道的逻辑就是字符串中第一个数字之前的所有内容。 For the house number, I split from the character left from the 1st digit found, and limit the split to 2 parts (part 0 & 1, thats why n=1 instead of 2 for 2 parts).对于门牌号码,我从找到的第一个数字左侧的字符开始拆分,并将拆分限制为 2 部分(部分 0 和 1,这就是为什么 n=1 而不是 2 的 2 部分)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM