简体   繁体   中英

Select a partial string from a list within a column in Pandas DataFrame

I have some DataFrame:

d = {'fruit': ['apple', 'pear', 'peach'], 'values': ['apple_1_0,peach_1_5','pear_1_3','mango_1_0,banana_1_0,pineapple_1_10']}
df = pd.DataFrame(data=d)
df

fruit   values
0   apple   apple_1_0,peach_1_5
1   pear    pear_1_3
2   peach   mango_1_0,banana_1_0,pineapple_1_10

The strings in the values column are comma separated, and I'd like the strings that contain the substring '_1_0'.

Desired output :

    fruit   values
0   apple   apple_1_0
1   pear    NaN
2   peach   mango_1_0,banana_1_0

Something like this is somewhat close to what I'm trying to do but is painfully slow over ~100,000 rows:

for row in range(len(df)):
    print([zero for zero in df['values'].str.split(',', expand=False)[row] if "_1_0" in zero])

['apple_1_0']
[]
['mango_1_0', 'banana_1_0']

Let us try explode

s = df['values'].str.split(',').explode()
df['New_values'] = s.where(s.str.endswith('_1_0')).dropna().groupby(level=0).agg(','.join)
df
Out[29]: 
   fruit                               values            New_values
0  apple                  apple_1_0,peach_1_5             apple_1_0
1   pear                             pear_1_3                   NaN
2  peach  mango_1_0,banana_1_0,pineapple_1_10  mango_1_0,banana_1_0

Straightforward solution:

import numpy as np
import pandas as pd

d = {'fruit': ['apple', 'pear', 'peach'],
     'values': ['apple_1_0,peach_1_5', 'pear_1_3', 'mango_1_0,banana_1_0,pineapple_1_10']}
df = pd.DataFrame(data=d)

new_data = df['values'].str.split(',')
new_data = new_data.apply(lambda lst: [elem for elem in lst if '_1_0' in elem])
new_data = new_data.str.join(",")
new_data = new_data.replace('', np.NaN)

This is an alternative, as a list comprehension :

    df["values"] = [ ",".join(entry if entry.endswith("1_0") 
                              else "" 
                              for entry in val.split(","))
                       .rstrip(",")
                   for val in df["values"]
                   ]

     df = df.replace({"": np.nan})

    df


   fruit    values
0   apple   apple_1_0
1   pear    NaN
2   peach   mango_1_0,banana_1_0

Using findall you may do this:

import numpy as np
import pandas as pd

d = {'fruit': ['apple', 'pear', 'peach'], 'values': ['apple_1_0,peach_1_5','pear_1_3','mango_1_0,banana_1_0,pineapple_1_10']}
df = pd.DataFrame(data=d)

df['values'] = df['values'].str.findall(r'[^,]*_1_0(?=,|$)').apply(','.join).replace('', np.NaN)    
print ( df )
   fruit                values
0  apple             apple_1_0
1   pear                   NaN
2  peach  mango_1_0,banana_1_0

Regex [^,]*_1_0(?=,|$) matches a non-comma string that ends in _1_0 followed by comma or end of string.


We can use a lambda as well:

df['values'] = df['values'].str.findall(r'[^,]*_1_0(?=,|$)').apply(lambda items: ','.join(items) if len(items) > 0 else np.NaN)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM