简体   繁体   中英

How to efficiently remove elements from series of pandas dataframe

I have a pandas dataframe of the following structure:

   Col1   |           Col2      |     Col3
   -------+---------------------+--------------
0   6     |    [a,b,c,d,e,f]    |     ....
1   4     |    [a,g,h,i]        |     ....
2   5     |    [a,b,j,k,l]      |     ....

I have a list of elements that I have to remove from all the lists in Col2 [a,b,h]

Finally I need to translate it to

   Col1   |           Col2  |     Col3
   -------+-----------------+--------------
0   4     |    [c,d,e,f]    |     ....
1   2     |    [g,i]        |     ....
2   3     |    [j,k,l]      |     ....

Where Col1 is the count of elements in Col2

I tried

def modify_data(dataset):
    ds = dataset.copy()      
    Col2 = dataset['Col2']
    remove_list = [a,b,h]
    removed_col2 = []
    counts = []
    for i,row in enumerate(Col2):
        cleaned = np.array(list(set(row)-set(remove_list)))
        removed_col2.append(cleaned)
        counts.append(len(cleaned))


    ds.loc[:,'Col1'] = counts
    ds.loc[:,'Col2'] = removed_col2
    return ds

But the performance is too bad. For example for a dataset with 200,000 rows.

CPU times: user 11min 26s, sys: 24.2 s, total: 11min 50s
Wall time: 11min 48s

I will try with

df.Col2 = (df.Col2.map(set)-set(['a','b','h'])).map(list)
df.Col1 = df.Col2.str.len()
df
Out[111]: 
           Col2  Col1
0  [f, e, c, d]     4
1        [g, i]     2
2     [j, k, l]     3

Another solution, using list comprehension :

df = pd.DataFrame(
    {
        "col1": [6, 4, 3],
        "col2": [
            ["a", "b", "c", "d", "e", "f"],
            ["a", "g", "h", "i"],
            ["a", "b", "j", "k", "l"],
        ],
    }
)

df['col2'] = [[value for value in entry
               if value not in ('a','b','h')] 
              for entry in df.col2
             ]
df['col1'] = df.col2.str.len()


   col1     col2
0   4   [c, d, e, f]
1   2   [g, i]
2   3   [j, k, l]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM