[英]Optimizing a groupby agg function to return multiple result columns
我有这个 dataframe;
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Client':np.random.choice(['Customer_A', 'Customer_B'], 1000),
'Product':np.random.choice( ['Guns', 'Ammo', 'Armour'], 1000),
'Value':(np.random.randn(1000))
})
Categoricals = ['Client', 'Product']
df[Categoricals] = df[Categoricals].astype('category')
df = df.drop_duplicates()
df
我想要这个结果;
# Non-anonymous function for Anomaly limit
def Anomaly (x):
Q3 = np.nanpercentile(x, q = 75)
Q1 = np.nanpercentile(x, q = 25)
IQR = (Q3 - Q1)
return (Q3 + (IQR * 2.0))
# Non-anonymous function for CriticalAnomaly limit
def CriticalAnomaly (x):
Q3 = np.nanpercentile(x, q = 75)
Q1 = np.nanpercentile(x, q = 25)
IQR = (Q3 - Q1)
return (Q3 + (IQR * 3.0))
# Define metrics
Metrics = {'Value':['count', Anomaly, CriticalAnomaly]}
# Groupby has more than 1 grouping column, so agg can only accept non-anonymous functions
Limits = df.groupby(['Client', 'Product']).agg(Metrics)
Limits
但在大型数据集上速度很慢,因为函数“Anomaly”和“CriticalAnomaly”必须重新计算 Q1、Q3 和 IQR 两次,而不是一次。 通过将这两个功能结合在一起,它会变得更快。 但结果是 output 进入 1 列而不是 2 列。
# Combined anomaly functions
def CombinedAnom (x):
Q3 = np.nanpercentile(x, q = 75)
Q1 = np.nanpercentile(x, q = 25)
IQR = (Q3 - Q1)
Anomaly = (Q3 + (IQR * 2.0))
CriticalAnomaly = (Q3 + (IQR * 3.0))
return (Anomaly, CriticalAnomaly)
# Define metrics
Metrics = {'Value':['count', CombinedAnom]}
# Groupby has more than 1 grouping column, so agg can only accept non-anonymous functions
Limits = df.groupby(['Client', 'Product']).agg(Metrics)
Limits
如何将 function 组合成两列?
如果您使用apply
而不是agg
,您可以返回一个解压到列中的Series
:
def f(g):
return pd.Series({
'c1': np.sum(g.b),
'c2': np.prod(g.b)
})
df = pd.DataFrame({'a': list('aabbcc'), 'b': [1,2,3,4,5,6]})
df.groupby('a').apply(f)
这来自:
a b
0 a 1
1 a 2
2 b 3
3 b 4
4 c 5
5 c 6
至
c1 c2
a
a 3 2
b 7 12
c 11 30
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.