简体   繁体   中英

Remove partial string from dataframe with Pandas

If I have a dataframe like this:

id    str
01    abc_d(a)
02    ab_d(a)
03    abcd_e(a)
04    a_b(a)

How can i get a dataframe as following ? Sorry i makeup this dataframe to represent my real issues. Thanks.

id    str
01    d
02    d
03    e
04    b

Using extract

df['str']=df['str'].str.extract("\_(.*)\(",expand=True) 
df
Out[585]: 
   id str
0   1   d
1   2   d
2   3   e
3   4   b

(Bad Answer)

Series.str.split soup

df['str'] = df['str'].str.split('(').str[0].str.split('_').str[-1]    
df

   id str
0   1   d
1   2   d
2   3   e
3   4   b

(Less Bad answer)

Series.str.extract

df['str'] = df['str'].str.extract(r'_([^_]+)\(', expand=False)
df

   id str
0   1   d
1   2   d
2   3   e
3   4   b

Regex methods come with their fair share of overhead, and str.extract does not do much to make things better.


(Better Answer)

re.search with list comp

import re

p = re.compile(r'(?<=_)[^_]+(?=\()')
df['str'] = [p.search(x)[0] for x in df['str'].tolist()] 
df

   id str
0   1   d
1   2   d
2   3   e
3   4   b

This should be faster than the above methods. I find list comprehensions are really fast compared to most vectorised string pandas methods, even if this does use regex. I pre-compile the pattern in advance to alleviate some of the performance concerns.


(Also a better answer)

str.split with list comp

df['str'] = [
    x.split('(', 1)[0].split('_')[1] for x in df['str'].tolist()
]
df

   id str
0   1   d
1   2   d
2   3   e
3   4   b

This combines the best of both worlds, the performance of a list comp and the speed of pure python string splitting. Should be the fastest.


Performance

df_test = pd.concat([df] * 10000, ignore_index=True)

%timeit df_test['str'].str.extract(r'_([^_]+)\(', expand=False)
%timeit df_test['str'].str.split('(').str[0].str.split('_').str[-1] 
%timeit [p.search(x)[0] for x in df_test['str'].tolist()] 
%timeit [x.split('(', 1)[0].split('_')[1] for x in df_test['str'].tolist()]

70.4 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
99.6 ms ± 730 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
31 ms ± 877 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
30 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # fastest but not by much

May be you can try split similar to example :

df['str'] = df['str'].str.split('_').str.get(1).str[0]

Or,

df['str'] = df['str'].str.split('_').str.get(1).str.split('(').str[0]

Using pd.Series.str.split . Specific to your particular format.

df['str'] = df['str'].str.split('_').str[-1].str[0]

print(df)

   id str
0   1   d
1   2   d
2   3   e
3   4   b

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM