I want to pass the numpy percentile()
function through pandas' agg()
function as I do below with various other numpy statistics functions.
Right now I have a dataframe that looks like this:
AGGREGATE MY_COLUMN
A 10
A 12
B 5
B 9
A 84
B 22
And my code looks like this:
grouped = dataframe.groupby('AGGREGATE')
column = grouped['MY_COLUMN']
column.agg([np.sum, np.mean, np.std, np.median, np.var, np.min, np.max])
The above code works, but I want to do something like
column.agg([np.sum, np.mean, np.percentile(50), np.percentile(95)])
Ie, specify various percentiles to return from agg()
.
How should this be done?
Perhaps not super efficient, but one way would be to create a function yourself:
def percentile(n):
def percentile_(x):
return np.percentile(x, n)
percentile_.__name__ = 'percentile_%s' % n
return percentile_
Then include this in your agg
:
In [11]: column.agg([np.sum, np.mean, np.std, np.median,
np.var, np.min, np.max, percentile(50), percentile(95)])
Out[11]:
sum mean std median var amin amax percentile_50 percentile_95
AGGREGATE
A 106 35.333333 42.158431 12 1777.333333 10 84 12 76.8
B 36 12.000000 8.888194 9 79.000000 5 22 12 76.8
Note sure this is how it should be done though...
You can have agg()
use a custom function to be executed on specified column:
# 50th Percentile
def q50(x):
return x.quantile(0.5)
# 90th Percentile
def q90(x):
return x.quantile(0.9)
my_DataFrame.groupby(['AGGREGATE']).agg({'MY_COLUMN': [q50, q90, 'max']})
Being more specific, if you just want to aggregate your pandas groupby results using the percentile function, the python lambda function offers a pretty neat solution. Using the question's notation, aggregating by the percentile 95, should be:
dataframe.groupby('AGGREGATE').agg(lambda x: np.percentile(x['COL'], q = 95))
You can also assign this function to a variable and use it in conjunction with other aggregation functions.
试试这个 50% 和 95% 的百分位数:
column.describe(percentiles=[0.5, 0.95])
I really like the solution Andy Hayden gave , however, this had multiple issues for me:
Here is an updated version that fixes these issues:
def percentile(n):
def percentile_(x):
return x.quantile(n)
percentile_.__name__ = 'percentile_{:2.0f}'.format(n*100)
return percentile_
我相信在熊猫中这样做的惯用方法是:
df.groupby("AGGREGATE").quantile([0, 0.25, 0.5, 0.75, 0.95, 1])
For situations where all you need is a subset of the describe
(typically the most common needed statistics) you can just index the returned pandas series without needing any extra functions.
For example, I commonly find myself just needing to present the 25th, median, 75th and count. This can be done in just one line like so:
columns.agg('describe')[['25%', '50%', '75%', 'count']]
For specifying your own set of percentiles, the chosen answer is a good choice, but for simple use case, there is no need for extra functions.
df.groupby("AGGREGATE").describe(percentiles=[0, 0.25, 0.5, 0.75, 0.95, 1])
默认情况下, describe
函数为我们提供mean, count, std, min, max
和百分位数数组,您可以选择所需的百分位数。
More efficient solution with pandas.Series.quantile
method:
df.groupby("AGGREGATE").agg(("YOUR_COL_NAME", lambda x: x.quantile(0.5))
With several percentile values
percentiles = [0.5, 0.9, 0.99]
quantile_funcs = [(p, lambda x: x.quantile(p)) for p in percentiles]
df.groupby("AGGREGATE").agg(quantile_funcs)
Just to throw a more general solution into the ring. Assume you have a DF with just one column to group:
df = pd.DataFrame((('A',10),('A',12),('B',5),('B',9),('A',84),('B',22)),
columns=['My_KEY', 'MY_COL1'])
One can aggregate and calcualte basically any descriptive metric with a list of anonymous (lambda) functions like:
df.groupby(['My_KEY']).agg( [np.sum, np.mean, lambda x: np.percentile(x, q=25)] )
However, if you have multiple columns to aggregate, you have to call a non anonymous function or call the columns explicitly:
df = pd.DataFrame((('A',10,3),('A',12,4),('B',5,6),('B',9,3),('A',84,2),('B',22,1)),
columns=['My_KEY', 'MY_COL1', 'MY_COL2'])
# non-anonymous function
def percentil25 (x):
return np.percentile(x, q=25)
# type 1: call for both columns
df.groupby(['My_KEY']).agg( [np.sum, np.mean, percentil25 ] )
# type 2: call each column separately
df.groupby(['My_KEY']).agg( {'MY_COL1': [np.sum, np.mean, lambda x: np.percentile(x, q=25)],
'MY_COL2': np.size})
You can also perhaps use lambda to achieve the same. Some thing like below piece of code :
agg(
lambda x: [
np.min(a=x),
np.percentile(q=25,a=x),
np.median(a=x),
np.percentile(q=75,a=x),
np.max(a=x)
]
)
This can provide some customization:
list_statistics = ['count','mean','min',lambda x: np.percentile(x,q=25),'max',lambda x: np.percentile(x,q=75)]
cols_to_rename = {'<lambda_0>':'P25','<lambda_1>':'P75'}
df_out.groupby('Country').agg(list_statistics).rename(columns=cols_to_rename)
Multiple function can be called as below:
import pandas as pd
import numpy as np
import random
C = ['Ram', 'Ram', 'Shyam', 'Shyam', 'Mahima', 'Ram', 'Ram', 'Shyam', 'Shyam', 'Mahima']
A = [ random.randint(0,100) for i in range(10) ]
B = [ random.randint(0,100) for i in range(10) ]
df = pd.DataFrame({ 'field_A': A, 'field_B': B, 'field_C': C })
print(df)
d = df.groupby('field_C')['field_A'].describe()[['mean', 'count', '25%', '50%', '75%']]
print(d)
I was unable to call median in this, but able to work other functions.
In case you have a dataframe with several columns and only want the quantiles for one column:
df.groupby("AGGREGATE")['MY_COLUMN'].quantile([0, 0.25, 0.5, 0.75, 0.95, 1])
and in case you want a 1 level dataframe you can add:
df.groupby("AGGREGATE")['MY_COLUMN'].quantile([0, 0.25, 0.5, 0.75, 0.95, 1]).reset_index()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.