High performance apply on group by pandas

Question

I need to calculate percentile on a column of a pandas dataframe. A subset of the dataframe is as below:

I want to calculate the 20th percentile of the SaleQTY, but for each group of ["Barcode","ShopCode"]: so I define a function as below:

def quant(group):
    group["Quantile"] = np.quantile(group["SaleQTY"], 0.2)
    return group

And apply this function on each group pf my sales data which has almost 18 million rows and roughly 3 million groups of ["Barcode","ShopCode"]:

quant_sale = sales.groupby(['Barcode','ShopCode']).apply(quant)

That took 2 hours to complete on a windows server with 128 GB Ram and 32 Core. It make not sense because that is one small part of my code. S o I start searching the net to enhance the performance. I came up with "numba" solution with below code which didn't work:

from numba import njit, jit
@jit(nopython=True)
def quant_numba(df):
    final_quant = []
    for bar_shop,group in df.groupby(['Barcode','ShopCode']):
        group["Quantile"] = np.quantile(group["SaleQTY"], 0.2)
        final_quant.append((bar_shop,group["Quantile"]))
    return final_quant    
result = quant_numba(sales)

It seems that I cannot use pandas objects within this decorator.

I am not sure whether I can use of multi processing (which I'm unfamiliar with the whole concept) or whether is there any solution to speed up my code. So any help would be appreciated.

Answer 1

You can try DataFrameGroupBy.quantile :

df1 = df.groupby(['Barcode', 'Shopcode'])['SaleQTY'].quantile(0.2)

Or like montioned @Jon Clements for new columns filled by percentiles use GroupBy.transform :

df['Quantile'] = df.groupby(['Barcode', 'Shopcode'])['SaleQTY'].transform('quantile', q=0.2)

Answer 2

There is a inbuilt function in panda called quantile().

quantile() will help to get nth percentile of a column in df.

Doc reference link

geeksforgeeks examplereference

High performance apply on group by pandas

Question

2 answers

solution1
3 ACCPTED 2020-02-09 08:03:17

solution2
1 2020-02-09 08:08:49

High performance apply on group by pandas

Question

2 answers

solution1 3 ACCPTED 2020-02-09 08:03:17

solution2 1 2020-02-09 08:08:49

solution1
3 ACCPTED 2020-02-09 08:03:17

solution2
1 2020-02-09 08:08:49