简体   繁体   中英

Pandas groupby different aggregation values with big dataframe

I have a dataframe with 700+ columns. I am doing a groupby with one column, lets say df.a, and I want to aggregate every column by mean except the last 10, which I want to aggregate my max. I am aware of creating a conditional dictionary and then passing into a groupby like this:

 d = {'DATE': 'last', 'AE_NAME': 'last', 'ANSWERED_CALL': 'sum'} res = df.groupby(df.a).agg(d)

However, with so many columns, I do not want to have to write this all out. Is there a quick way to do this?

You could use zip and some not really elegant code imo but it works:

cols = df.drop("A", axis=1).columns # drop groupby column since not in agg

len_means = len(cols[:-10]) # grabbing all cols except the last ten ones

len_max = len(cols[-10:] # grabbing the last ten cols length

d_means = {i:j for i,j in zip(cols[:-10], ["mean"]*len_means)}

d_max = {i:j for i,j in zip(cols[-10:], ["max"]*len_max)}

d = d_means.update(d_max}

res = df.groupby(df.a).agg(d)

Edit : since OP mentioned the columns are named differently (ending with letter c then)

c_cols = [col for col in df.columns if col.endswith('c')]
non_c_cols = [col for col in df.columns if col not in c_cols]

and one only needs to plug the cols in the code above the get the result

I would approach this problem the following:

  1. Define a cutoff for which columns to select
  2. Select the columns you need
  3. Create both your mean and max aggregation with GroupBy
  4. Join both dataframes together:
# example dataframe
df = pd.DataFrame(np.random.rand(5,10), columns=list('abcdefghij'))
df.insert(0, 'ID', ['aaa', 'bbb', 'aaa', 'ccc', 'bbb'])

    ID         a         b         c         d         e         f         g         h         i         j
0  aaa  0.228208  0.822641  0.407747  0.416335  0.039717  0.854789  0.108124  0.666190  0.074569  0.329419
1  bbb  0.285293  0.274654  0.507607  0.527335  0.599833  0.511760  0.747992  0.930221  0.396697  0.959254
2  aaa  0.844373  0.431420  0.083631  0.656162  0.511913  0.486187  0.955340  0.130358  0.759013  0.181874
3  ccc  0.259888  0.992480  0.365106  0.041288  0.833069  0.474904  0.212645  0.178981  0.595891  0.143127
4  bbb  0.823457  0.172947  0.907415  0.719616  0.632012  0.199703  0.672745  0.563852  0.120827  0.092455
cutoff = 7
mean_cols = df.columns[:cutoff]
max_cols = ['ID'] + df.columns[cutoff:].tolist()

df1 = df[mean_cols].groupby('ID').mean()
df2 = df[max_cols].groupby('ID').max()

df = df1.join(df2).reset_index()

    ID         a         b         c         d         e         f         g         h         i         j
0  aaa  0.536290  0.627031  0.245689  0.536248  0.275815  0.670488  0.955340  0.666190  0.759013  0.329419
1  bbb  0.554375  0.223800  0.707511  0.623476  0.615923  0.355732  0.747992  0.930221  0.396697  0.959254
2  ccc  0.259888  0.992480  0.365106  0.041288  0.833069  0.474904  0.212645  0.178981  0.595891  0.143127

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM