简体   繁体   中英

python dataframe group by with formulas and variable numbers of column

I need to create a grouping data frame with a variable number of columns which has calculated fields.

Not sure even how to describe this so I made a small table attached. There are always four sets of columns, the fifth column contains a formula making calculations on the other four preceding columns.

The problem is that I need to group the results and the formula should be calculated on sums of individual columns.

The question is how to make it assuming that I will have multiple groups of columns and multiple grouping criteria.

表

Since you have not provided your data in DataFrame format, I have made some assumptions about its structure. First, create some representative data.

import pandas as pd
from itertools import product

setdata = [[12, 4, 0, 0, 12, 3, 1, 0],
           [12, 5, 0, 0, 12, 2, 1, 0],
           [12, 4, 0, 0, 12, 3, 1, 0],
           [ 6, 0, 0, 0,  6, 0, 0, 0],
           [ 7, 4, 0, 1,  7, 1, 0, 0],
           [ 7, 5, 0, 1,  7, 2, 1, 0],
           [ 7, 4, 0, 2,  7, 2, 0, 0]]

df_sets = pd.DataFrame(setdata)
level0 = ['Set{}'.format(i) for i in range(1, df_sets.shape[-1] // 4 + 1)]
level1 = ['A', 'B', 'C', 'D']
df_sets.columns = pd.MultiIndex.from_tuples(product(level0, level1))
df_sets.insert(0, 'Age', [3, 3, 3, 4, 3, 3, 6])
df_sets.insert(1, 'Gender', [1, 1, 1, 2, 2, 1, 1])
print(df_sets)

Output:

  Age Gender Set1          Set2         
                A  B  C  D    A  B  C  D
0   3      1   12  4  0  0   12  3  1  0
1   3      1   12  5  0  0   12  2  1  0
2   3      1   12  4  0  0   12  3  1  0
3   4      2    6  0  0  0    6  0  0  0
4   3      2    7  4  0  1    7  1  0  0
5   3      1    7  5  0  1    7  2  1  0
6   6      1    7  4  0  2    7  2  0  0

Then, you want to aggregate by age and gender.

df_grp = df_sets.groupby(['Age', 'Gender']).sum()
print(df_grp)

Output:

           Set1           Set2          
              A   B  C  D    A   B  C  D
Age Gender                              
3   1        43  18  0  1   43  10  4  0
    2         7   4  0  1    7   1  0  0
4   2         6   0  0  0    6   0  0  0
6   1         7   4  0  2    7   2  0  0

Then, compute and append the 5th column (here, "E") for each set and reorder the columns so each "E" column is printed along with its corresponding set.

for idx, grp in df_grp.groupby(level=0, axis=1):
    df_grp[(idx, 'E')] = grp[idx][['B', 'C', 'D']].sum(axis=1) / grp[(idx, 'A')]
df_grp.sort_index(axis=1, inplace=True)
print(df_grp)

Output:

              A   B  C  D         E    A   B  C  D         E
Age Gender                                                  
3   1        43  18  0  1  0.441860   43  10  4  0  0.325581
    2         7   4  0  1  0.714286    7   1  0  0  0.142857
4   2         6   0  0  0  0.000000    6   0  0  0  0.000000
6   1         7   4  0  2  0.857143    7   2  0  0  0.285714

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM