I've got a dataframe in the format:
template is_a is_b is_c is_d is_e
0 cv_template 0 1 0 0 0
1 topic_template 1 0 0 0 0
2 model_template 1 0 0 0 0
3 model_template 0 1 0 0 0
I would like to group by the template
and aggregate the is_
columns which are binary values for each template
.
ie in the example above, the output would be:
template is_a is_b is_c is_d is_e
0 cv_template 0 1 0 0 0
1 topic_template 1 0 0 0 0
2 model_template 1 1 0 0 0
my current solution is to do something like this:
df.groupby('template', as_index=False)['is_a', 'is_b', 'is_c', 'is_d'].max()
However, when working on large datasets, the group by is slow. I was wondering if there was a better way of doing this which would speed things up.
I can't be certain this will be much quicker. But, I put this together with Numba
import pandas as pd
import numpy as np
from numba import njit
@njit
def max_at(i, a, shape):
out = np.zeros(shape, np.int64)
for j in range(len(a)):
row = a[j]
pos = i[j]
cur = out[pos]
out[pos, :] = np.maximum(cur, row)
return out
i, t = df['template'].factorize()
cols = ['is_a', 'is_b', 'is_c', 'is_d', 'is_e']
is_ = np.column_stack([df[c].to_numpy() for c in cols])
result = max_at(i, is_, (len(t), len(cols)))
pd.DataFrame(result, t, cols).reset_index()
index is_a is_b is_c is_d is_e
0 cv_template 0 1 0 0 0
1 topic_template 1 0 0 0 0
2 model_template 1 1 0 0 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.