简体   繁体   中英

Split string column and extract second part in Python

Let's say I have a dataframe as follows:

df = pd.DataFrame({"id": range(4), "price": ["15dollar/m2/day", "90dollar/m2/month", "18dollar/m2/day", "100dollar/m2/month"]})

       id               price
    0   0     15dollar/m2/day
    1   1   90dollar/m2/month
    2   2     18dollar/m2/day
    3   3  100dollar/m2/month

I would like to split column price into two new columns: unit_price and price_unit as below:

   id     unit_price  price_unit
0   0        15.0    dollar/m2/day
1   1        90.0    dollar/m2/month
2   2        18.0    dollar/m2/day
3   3       100.0    dollar/m2/month

This is my solution:

df['unit_price'] = df['price'].str.split('dollar').str[0].astype(float)
#df['unit_price'] = df['price'].str.extract('(\d*\.\d+|\d+)', expand=False).astype(float)
df['price_unit'] = df['price'].str.split('dollar').str[1]
del df['price']

For the column unit_price , it works fine, but for price_unit , when i split by dollar , I got result as below which doesn't including character dollar , or if I use df['price'].str.replace(r'\\d', '') , all numbers were removed. How could I do it correctly in Python? Thanks.

df['price_unit']
Out[474]: 
0      /m2/day
1    /m2/month
2      /m2/day
3    /m2/month
Name: price_unit, dtype: object 

You can use Series.str.extract with regex - ^ for start of string, \\d*\\.\\d+ for floats or \\d+ for integers and then for all another values by .* :

df = df.join(df.pop('price').str.extract('(?P<unit_price>^\d*\.\d+|^\d+)(?P<price_unit>.*)'))
print (df)
   id unit_price       price_unit
0   0         15    dollar/m2/day
1   1         90  dollar/m2/month
2   2         18    dollar/m2/day
3   3        100  dollar/m2/month

First solution is using extract and replace by numeric:

pat = '(^\d*\.\d+|^\d+)'
df['unit_price'] = df['price'].str.extract(pat, expand=False)
df['price_unit'] = df.pop('price').str.replace(pat,'')
print (df)
   id unit_price       price_unit
0   0         15    dollar/m2/day
1   1         90  dollar/m2/month
2   2         18    dollar/m2/day
3   3        100  dollar/m2/month

If it's just a string, why not keep it simple and add the dollar part yourself?

df['price_unit'] = 'dollar' + df['price'].str.split('dollar').str[1]

import pandas as pd

df = pd.DataFrame({"id": range(4), "price": ["15dollar/m2/day", "90dollar/m2/month", "18dollar/m2/day", "100dollar/m2/month"]})

df['unit_price'] = df['price'].str.split('dollar').str[0].astype(float)
df['price_unit'] = 'dollar' + df['price'].str.split('dollar').str[1]

del df['price']

print(df)

   id  unit_price       price_unit
0   0        15.0    dollar/m2/day
1   1        90.0  dollar/m2/month
2   2        18.0    dollar/m2/day
3   3       100.0  dollar/m2/month

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM