简体   繁体   English

python dataframe group by具有列的公式和变量号

[英]python dataframe group by with formulas and variable numbers of column

I need to create a grouping data frame with a variable number of columns which has calculated fields. 我需要使用具有已计算字段的可变列数来创建分组数据框。

Not sure even how to describe this so I made a small table attached. 我什至不知道该怎么形容,于是我做了一张小桌子。 There are always four sets of columns, the fifth column contains a formula making calculations on the other four preceding columns. 总有四组列,第五列包含一个公式,用于对前面的其他四列进行计算。

The problem is that I need to group the results and the formula should be calculated on sums of individual columns. 问题是我需要对结果进行分组,并且应该根据各个列的总和来计算公式。

The question is how to make it assuming that I will have multiple groups of columns and multiple grouping criteria. 问题是如何假设我将有多组列和多个分组条件来实现。

表

Since you have not provided your data in DataFrame format, I have made some assumptions about its structure. 由于您尚未以DataFrame格式提供数据,因此我对其结构做了一些假设。 First, create some representative data. 首先,创建一些代表性数据。

import pandas as pd
from itertools import product

setdata = [[12, 4, 0, 0, 12, 3, 1, 0],
           [12, 5, 0, 0, 12, 2, 1, 0],
           [12, 4, 0, 0, 12, 3, 1, 0],
           [ 6, 0, 0, 0,  6, 0, 0, 0],
           [ 7, 4, 0, 1,  7, 1, 0, 0],
           [ 7, 5, 0, 1,  7, 2, 1, 0],
           [ 7, 4, 0, 2,  7, 2, 0, 0]]

df_sets = pd.DataFrame(setdata)
level0 = ['Set{}'.format(i) for i in range(1, df_sets.shape[-1] // 4 + 1)]
level1 = ['A', 'B', 'C', 'D']
df_sets.columns = pd.MultiIndex.from_tuples(product(level0, level1))
df_sets.insert(0, 'Age', [3, 3, 3, 4, 3, 3, 6])
df_sets.insert(1, 'Gender', [1, 1, 1, 2, 2, 1, 1])
print(df_sets)

Output: 输出:

  Age Gender Set1          Set2         
                A  B  C  D    A  B  C  D
0   3      1   12  4  0  0   12  3  1  0
1   3      1   12  5  0  0   12  2  1  0
2   3      1   12  4  0  0   12  3  1  0
3   4      2    6  0  0  0    6  0  0  0
4   3      2    7  4  0  1    7  1  0  0
5   3      1    7  5  0  1    7  2  1  0
6   6      1    7  4  0  2    7  2  0  0

Then, you want to aggregate by age and gender. 然后,您想按年龄和性别进行汇总。

df_grp = df_sets.groupby(['Age', 'Gender']).sum()
print(df_grp)

Output: 输出:

           Set1           Set2          
              A   B  C  D    A   B  C  D
Age Gender                              
3   1        43  18  0  1   43  10  4  0
    2         7   4  0  1    7   1  0  0
4   2         6   0  0  0    6   0  0  0
6   1         7   4  0  2    7   2  0  0

Then, compute and append the 5th column (here, "E") for each set and reorder the columns so each "E" column is printed along with its corresponding set. 然后,为每个集合计算并附加第5列(此处为“ E”),并对各列进行重新排序,以便将每个“ E”列及其对应的集合一起打印。

for idx, grp in df_grp.groupby(level=0, axis=1):
    df_grp[(idx, 'E')] = grp[idx][['B', 'C', 'D']].sum(axis=1) / grp[(idx, 'A')]
df_grp.sort_index(axis=1, inplace=True)
print(df_grp)

Output: 输出:

              A   B  C  D         E    A   B  C  D         E
Age Gender                                                  
3   1        43  18  0  1  0.441860   43  10  4  0  0.325581
    2         7   4  0  1  0.714286    7   1  0  0  0.142857
4   2         6   0  0  0  0.000000    6   0  0  0  0.000000
6   1         7   4  0  2  0.857143    7   2  0  0  0.285714

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM