简体   繁体   中英

Splitting a string on a special character in a pandas dataframe column based on a conditional

I am trying to establish conformity in an address column in my pandas dataframe. I have a ZipCode Column that has two formats: 1) 87301 2) 87301-1234. Not every row has the hyphen so I need to split on the hyphen when it is present.

My data looks like this:

State  ZIP
CA     85145-7045
PA     76913   

I have tried a few methods of tackling this problem. I have tried:

data['Zip_1'],data['Zip_2'] = data['Zip'].str.split('-').str

I have tried:

data['Zip'] = data['Zip'].str.split('-', n=1, expand=True)
data['Zip'] = data['Zip'][0]
data['Zip_drop'] = data['Zip'][1]

I have also tried using a lambda function.

However it just returns nulls.

I would expect the new column to return NaN for zipcodes that do not have the hyphen and the numbers after the hyphen if it does contain the hyphen. However, the new column just populates NaN for every observation

You can do that by using " replace " combined with regular expressions .

Step 1

example_df = pd.DataFrame({'State': ['CA', 'PA'],
                           'ZIP': ['85145-7045', '76913'] })

example_df

在此处输入图片说明

Step 2

# Keep only the numbers before the hyphen (if any).
example_df = example_df.replace('\-\d*', '', regex=True)
example_df

输出

Get a dataframe of all zipcodes containing a hyphen, and place it in a new column

data['Zip Hyphen'] = data['Zip'].str.find('-')

Then, from the dataframe with column Zip, drop any rows where there is a hyphen contained

 data = data.drop(data[data['Zip'].str.find('-')].index)

EDIT: This code is not tested but the general idea is there

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM