I have a pandas dataframe of the following structure:
Col1 | Col2 | Col3
-------+---------------------+--------------
0 6 | [a,b,c,d,e,f] | ....
1 4 | [a,g,h,i] | ....
2 5 | [a,b,j,k,l] | ....
I have a list of elements that I have to remove from all the lists in Col2 [a,b,h]
Finally I need to translate it to
Col1 | Col2 | Col3
-------+-----------------+--------------
0 4 | [c,d,e,f] | ....
1 2 | [g,i] | ....
2 3 | [j,k,l] | ....
Where Col1
is the count of elements in Col2
I tried
def modify_data(dataset):
ds = dataset.copy()
Col2 = dataset['Col2']
remove_list = [a,b,h]
removed_col2 = []
counts = []
for i,row in enumerate(Col2):
cleaned = np.array(list(set(row)-set(remove_list)))
removed_col2.append(cleaned)
counts.append(len(cleaned))
ds.loc[:,'Col1'] = counts
ds.loc[:,'Col2'] = removed_col2
return ds
But the performance is too bad. For example for a dataset with 200,000 rows.
CPU times: user 11min 26s, sys: 24.2 s, total: 11min 50s
Wall time: 11min 48s
I will try with
df.Col2 = (df.Col2.map(set)-set(['a','b','h'])).map(list)
df.Col1 = df.Col2.str.len()
df
Out[111]:
Col2 Col1
0 [f, e, c, d] 4
1 [g, i] 2
2 [j, k, l] 3
Another solution, using list comprehension
:
df = pd.DataFrame(
{
"col1": [6, 4, 3],
"col2": [
["a", "b", "c", "d", "e", "f"],
["a", "g", "h", "i"],
["a", "b", "j", "k", "l"],
],
}
)
df['col2'] = [[value for value in entry
if value not in ('a','b','h')]
for entry in df.col2
]
df['col1'] = df.col2.str.len()
col1 col2
0 4 [c, d, e, f]
1 2 [g, i]
2 3 [j, k, l]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.