简体   繁体   中英

How to parse an address string to street and house number

So I want to separate the street and house number from the address line. I can split the address based on the last space (my code below). But this won't help for the case in line 3, where the house number actually contains space.

address             street          house_number
my street 6         my street       6
my street 10a       my street       10a
next street 5 c     next street     5 c
next street100      next street     100

My best try, which does not help with the 3rd case:

df['street'] = df['address'].apply(lambda x: ' '.join(x.split(' ')[:-1]))
df['house_number'] = df['address'].apply(lambda x: x.split(' ')[-1])

My idea would be: identify the first digit (number) in the string and split the string in 2 parts from there. Regex? I tried but no solution

Code for reproduction

data = {'address': ['my street 6', 'my street 10a', 'next street 5 c', 'next street100'],
        'street': ['my street', 'my street', 'next street', 'next street'],
        'house_number': ['6', '10a', '5 c', '100']
        }
df = pd.DataFrame(data)

EDITED: 4th case added

I think this will do; Use.str.split() to split by the space that comes before the digit

Data

df=pd.DataFrame({'address':['my street 6','my street 10a','next street 5 c']})

Solution

df.address.str.split('\s(?=\d)', expand=True).rename(columns={0:'street',1:'house_number'})

Outcome

      street        house_number
0    my street            6
1    my street          10a
2  next street          5 c

If you wanted to include the original column. Please try;

df1=df.join(df.address.str.split('\s(?=\d)', expand=True).rename(columns={0:'street',1:'house_number'}))



        address       street     house_number
0      my street 6    my street            6
1    my street 10a    my street          10a
2  next street 5 c  next street          5 c

RegEx explaination

The RegEx looks for the position of the space (\s), with a condition (?= ) that a digit (\d) would follow it (?=\d)

For the 4th case in my question, this is the solution I came up with:

df['street'] = df.address.str.split('\d', expand=True)[0]
df['house_number'] = df.address.str.split('.(?=\d)', n=1, expand=True)[1]

So the logic for the street is simply everything prior to the first number in the string. For the house number, I split from the character left from the 1st digit found, and limit the split to 2 parts (part 0 & 1, thats why n=1 instead of 2 for 2 parts).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM