简体   繁体   English

向量化对一个 Pandas dataframe 的不同列的聚合操作

[英]Vectorizing the aggregation operation on different columns of a Pandas dataframe

I have a Pandas dataframe, mostly containing boolean columns.我有一个 Pandas dataframe,主要包含 boolean 列。 A small example is:一个小例子是:

import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3, 1, 2, 3],
                   "B": ['a', 'b', 'c', 'a', 'b', 'c'],
                   "f1": [True, True, True, True, True, False],
                   "f2": [True, True, True, True, False, True],
                   "f3": [True, True, True, False, True, True],
                   "f4": [True, True, False, True, True, True],
                   "f5": [True, False, True, True, True, True],
                   "target1": [True, False, True, True, False, True],
                   "target2": [False, True, True, False, True, False]})

df

Outout:输出:

    A   B   f1      f2      f3      f4      f5    target1  target2
0   1   a   True    True    True    True    True    True    False
1   2   b   True    True    True    True    False   False   True
2   3   c   True    True    True    False   True    True    True
3   1   a   True    True    False   True    True    True    False
4   2   b   True    False   True    True    True    False   True
5   3   c   False   True    True    True    True    True    False

for each True and False class of each f columns and for all groups in ("A", "B") columns, I want to do a sum over target1 and target2 columns.对于每个f列的每个 True 和 False class 以及("A", "B")列中的所有组,我想对target1target2列求和。 Using a loop over f columns, we have:f列上使用循环,我们有:

for col in ["f1", "f2", "f3", "f4", "f5"]:
    print(col, "\n", 
          df[df[col]].groupby(["A", "B"]).agg({"target1": "sum", "target2": "sum"}), "\n",
          df[~df[col]].groupby(["A", "B"]).agg({"target1": "sum", "target2": "sum"}))

Now, I need to do it without the for loop;现在,我需要在没有for循环的情况下完成它; I mean a vecotization over f columns to reduce the computation time (computation time should be almost equal to time needed for doing it for one f column).我的意思是对f列进行 vecotization 以减少计算时间(计算时间应该几乎等于为一个f列执行此操作所需的时间)。

Use DataFrame.melt , so possible aggreagte by columns names f and value for True/False s:使用DataFrame.melt ,因此可能按列名fTrue/False s 的value进行聚合:

df = df.melt(['A','B','target1','target2'])

df1 = df.groupby(["A", "B","variable","value"]).agg({"target1": "sum", "target2": "sum"})
print (df1)
                    target1  target2
A B variable value                  
1 a f1       True         2        0
    f2       True         2        0
    f3       False        1        0
             True         1        0
    f4       True         2        0
    f5       True         2        0
2 b f1       True         0        2
    f2       False        0        1
             True         0        1
    f3       True         0        2
    f4       True         0        2
    f5       False        0        1
             True         0        1
3 c f1       False        1        0
             True         1        1
    f2       True         2        1
    f3       True         2        1
    f4       False        1        1
             True         1        0
    f5       True         2        1

Then selecting is possible by:然后可以通过以下方式进行选择:

print (df1.query("variable=='f1' and value==True").droplevel([-1,-2]))
     target1  target2
A B                  
1 a        2        0
2 b        0        2
3 c        1        1

Or:或者:

idx = pd.IndexSlice
print (df1.loc[idx[:, :, 'f1', True],:].droplevel([-1,-2]))
     target1  target2
A B                  
1 a        2        0
2 b        0        2
3 c        1        1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM