I have a dataframe with 700+ columns. I am doing a groupby with one column, lets say df.a, and I want to aggregate every column by mean except the last 10, which I want to aggregate my max. I am aware of creating a conditional dictionary and then passing into a groupby like this:
d = {'DATE': 'last', 'AE_NAME': 'last', 'ANSWERED_CALL': 'sum'} res = df.groupby(df.a).agg(d)
However, with so many columns, I do not want to have to write this all out. Is there a quick way to do this?
You could use zip
and some not really elegant code imo but it works:
cols = df.drop("A", axis=1).columns # drop groupby column since not in agg
len_means = len(cols[:-10]) # grabbing all cols except the last ten ones
len_max = len(cols[-10:] # grabbing the last ten cols length
d_means = {i:j for i,j in zip(cols[:-10], ["mean"]*len_means)}
d_max = {i:j for i,j in zip(cols[-10:], ["max"]*len_max)}
d = d_means.update(d_max}
res = df.groupby(df.a).agg(d)
Edit : since OP mentioned the columns are named differently (ending with letter c
then)
c_cols = [col for col in df.columns if col.endswith('c')]
non_c_cols = [col for col in df.columns if col not in c_cols]
and one only needs to plug the cols in the code above the get the result
I would approach this problem the following:
mean
and max
aggregation with GroupBy
Join
both dataframes together: # example dataframe
df = pd.DataFrame(np.random.rand(5,10), columns=list('abcdefghij'))
df.insert(0, 'ID', ['aaa', 'bbb', 'aaa', 'ccc', 'bbb'])
ID a b c d e f g h i j
0 aaa 0.228208 0.822641 0.407747 0.416335 0.039717 0.854789 0.108124 0.666190 0.074569 0.329419
1 bbb 0.285293 0.274654 0.507607 0.527335 0.599833 0.511760 0.747992 0.930221 0.396697 0.959254
2 aaa 0.844373 0.431420 0.083631 0.656162 0.511913 0.486187 0.955340 0.130358 0.759013 0.181874
3 ccc 0.259888 0.992480 0.365106 0.041288 0.833069 0.474904 0.212645 0.178981 0.595891 0.143127
4 bbb 0.823457 0.172947 0.907415 0.719616 0.632012 0.199703 0.672745 0.563852 0.120827 0.092455
cutoff = 7
mean_cols = df.columns[:cutoff]
max_cols = ['ID'] + df.columns[cutoff:].tolist()
df1 = df[mean_cols].groupby('ID').mean()
df2 = df[max_cols].groupby('ID').max()
df = df1.join(df2).reset_index()
ID a b c d e f g h i j
0 aaa 0.536290 0.627031 0.245689 0.536248 0.275815 0.670488 0.955340 0.666190 0.759013 0.329419
1 bbb 0.554375 0.223800 0.707511 0.623476 0.615923 0.355732 0.747992 0.930221 0.396697 0.959254
2 ccc 0.259888 0.992480 0.365106 0.041288 0.833069 0.474904 0.212645 0.178981 0.595891 0.143127
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.