[英]Optimizing a groupby agg function to return multiple result columns
I have this dataframe;我有这个 dataframe;
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Client':np.random.choice(['Customer_A', 'Customer_B'], 1000),
'Product':np.random.choice( ['Guns', 'Ammo', 'Armour'], 1000),
'Value':(np.random.randn(1000))
})
Categoricals = ['Client', 'Product']
df[Categoricals] = df[Categoricals].astype('category')
df = df.drop_duplicates()
df
And I want this result;我想要这个结果;
# Non-anonymous function for Anomaly limit
def Anomaly (x):
Q3 = np.nanpercentile(x, q = 75)
Q1 = np.nanpercentile(x, q = 25)
IQR = (Q3 - Q1)
return (Q3 + (IQR * 2.0))
# Non-anonymous function for CriticalAnomaly limit
def CriticalAnomaly (x):
Q3 = np.nanpercentile(x, q = 75)
Q1 = np.nanpercentile(x, q = 25)
IQR = (Q3 - Q1)
return (Q3 + (IQR * 3.0))
# Define metrics
Metrics = {'Value':['count', Anomaly, CriticalAnomaly]}
# Groupby has more than 1 grouping column, so agg can only accept non-anonymous functions
Limits = df.groupby(['Client', 'Product']).agg(Metrics)
Limits
But it's slow on large datasets because the functions "Anomaly" and "CriticalAnomaly" have to recalculate Q1, Q3 and IQR twice, instead of once.但在大型数据集上速度很慢,因为函数“Anomaly”和“CriticalAnomaly”必须重新计算 Q1、Q3 和 IQR 两次,而不是一次。 By combining both functions together makes it much faster.
通过将这两个功能结合在一起,它会变得更快。 But the results are output into 1 column instead of 2.
但结果是 output 进入 1 列而不是 2 列。
# Combined anomaly functions
def CombinedAnom (x):
Q3 = np.nanpercentile(x, q = 75)
Q1 = np.nanpercentile(x, q = 25)
IQR = (Q3 - Q1)
Anomaly = (Q3 + (IQR * 2.0))
CriticalAnomaly = (Q3 + (IQR * 3.0))
return (Anomaly, CriticalAnomaly)
# Define metrics
Metrics = {'Value':['count', CombinedAnom]}
# Groupby has more than 1 grouping column, so agg can only accept non-anonymous functions
Limits = df.groupby(['Client', 'Product']).agg(Metrics)
Limits
How can I make a combined function so the results go into 2 columns?如何将 function 组合成两列?
If you use apply
instead of agg
, you can return a Series
that gets unpacked into columns:如果您使用
apply
而不是agg
,您可以返回一个解压到列中的Series
:
def f(g):
return pd.Series({
'c1': np.sum(g.b),
'c2': np.prod(g.b)
})
df = pd.DataFrame({'a': list('aabbcc'), 'b': [1,2,3,4,5,6]})
df.groupby('a').apply(f)
This goes from:这来自:
a b
0 a 1
1 a 2
2 b 3
3 b 4
4 c 5
5 c 6
to至
c1 c2
a
a 3 2
b 7 12
c 11 30
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.