简体   繁体   中英

Return the the product with highest sale for each quarter from transaction data

I have a data frame as follows.

df_sample=pd.DataFrame({'ID':['ID1','ID2','ID2','ID2','ID1','ID2','ID1','ID1'],
         "quarter":['2016Q1','2016Q1','2016Q1','2017Q1','2017Q1','2018Q1','2018Q2','2018Q3'],
         "product":['productA','productB','productA','productD','productA','productA','productD','ProductA'],
         "sales":[100,200,100,400,100,500,400,100]})

I want to get the top product based on the cumulative sales amount for each ID. ie for ID1 for the 2018Q1 quarter, I want to take the sum of each product sold for all data <=2018Q1 and return the product name for each ID. Thanks in advance.

Expected output:

pd.DataFrame({'ID':['ID1','ID1','ID1','ID1',   'ID2','ID2','ID2'],
             "quarter":['2016Q1','2017Q1','2018Q2','2018Q3','2016Q1','2017Q1','2018Q1'],
             "product":['productA','productA','productD','productD','productB','ProductD','productA']})

IIUC, you can use a double groupby :

(df_sample
 .groupby(['ID', 'quarter', 'product'])['sales'].sum()
 .unstack('product', fill_value=0)
 .groupby('ID').cumsum()
 .idxmax(1)
)

output:

ID   quarter
ID1  2016Q1     productA
     2017Q1     productA
     2018Q2     productD
     2018Q3     productD
ID2  2016Q1     productB
     2017Q1     productD
     2018Q1     productA
dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM