[英]Vectorizing the aggregation operation on different columns of a Pandas dataframe
I have a Pandas dataframe, mostly containing boolean columns.我有一个 Pandas dataframe,主要包含 boolean 列。 A small example is:
一个小例子是:
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3, 1, 2, 3],
"B": ['a', 'b', 'c', 'a', 'b', 'c'],
"f1": [True, True, True, True, True, False],
"f2": [True, True, True, True, False, True],
"f3": [True, True, True, False, True, True],
"f4": [True, True, False, True, True, True],
"f5": [True, False, True, True, True, True],
"target1": [True, False, True, True, False, True],
"target2": [False, True, True, False, True, False]})
df
Outout:输出:
A B f1 f2 f3 f4 f5 target1 target2
0 1 a True True True True True True False
1 2 b True True True True False False True
2 3 c True True True False True True True
3 1 a True True False True True True False
4 2 b True False True True True False True
5 3 c False True True True True True False
for each True and False class of each f
columns and for all groups in ("A", "B")
columns, I want to do a sum over target1
and target2
columns.对于每个
f
列的每个 True 和 False class 以及("A", "B")
列中的所有组,我想对target1
和target2
列求和。 Using a loop over f
columns, we have:在
f
列上使用循环,我们有:
for col in ["f1", "f2", "f3", "f4", "f5"]:
print(col, "\n",
df[df[col]].groupby(["A", "B"]).agg({"target1": "sum", "target2": "sum"}), "\n",
df[~df[col]].groupby(["A", "B"]).agg({"target1": "sum", "target2": "sum"}))
Now, I need to do it without the for
loop;现在,我需要在没有
for
循环的情况下完成它; I mean a vecotization over f
columns to reduce the computation time (computation time should be almost equal to time needed for doing it for one f
column).我的意思是对
f
列进行 vecotization 以减少计算时间(计算时间应该几乎等于为一个f
列执行此操作所需的时间)。
Use DataFrame.melt
, so possible aggreagte by columns names f
and value
for True/False
s:使用
DataFrame.melt
,因此可能按列名f
和True/False
s 的value
进行聚合:
df = df.melt(['A','B','target1','target2'])
df1 = df.groupby(["A", "B","variable","value"]).agg({"target1": "sum", "target2": "sum"})
print (df1)
target1 target2
A B variable value
1 a f1 True 2 0
f2 True 2 0
f3 False 1 0
True 1 0
f4 True 2 0
f5 True 2 0
2 b f1 True 0 2
f2 False 0 1
True 0 1
f3 True 0 2
f4 True 0 2
f5 False 0 1
True 0 1
3 c f1 False 1 0
True 1 1
f2 True 2 1
f3 True 2 1
f4 False 1 1
True 1 0
f5 True 2 1
Then selecting is possible by:然后可以通过以下方式进行选择:
print (df1.query("variable=='f1' and value==True").droplevel([-1,-2]))
target1 target2
A B
1 a 2 0
2 b 0 2
3 c 1 1
Or:或者:
idx = pd.IndexSlice
print (df1.loc[idx[:, :, 'f1', True],:].droplevel([-1,-2]))
target1 target2
A B
1 a 2 0
2 b 0 2
3 c 1 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.