简体   繁体   中英

How to groupby multiple columns in pandas DataFrame in pct_change calculation

I am applying a pct_change calculation to a pandas dataframe. Everything works fine when the month column is ordered. When it is not the calculation comes out incorrect.

Here is my code now:

data = [
('product_a','1/31/2014',53)
,('product_b','1/31/2014',44)
,('product_c','1/31/2014',36)
,('product_a','11/30/2013',52)
,('product_b','11/30/2013',43)
,('product_c','11/30/2013',35)
,('product_a','3/31/2014',50)
,('product_b','3/31/2014',41)
,('product_c','3/31/2014',34)
,('product_a','12/31/2013',50)
,('product_b','12/31/2013',41)
,('product_c','12/31/2013',34)
,('product_a','2/28/2014',52)
,('product_b','2/28/2014',43)
,('product_c','2/28/2014',35)
]

product_df = DataFrame( data, columns=['prod_desc','activity_month','prod_count'] )

for index, row in product_df.iterrows():
  row['activity_month']= datetime.strptime(row['activity_month'],'%m/%d/%Y')
  product_df.loc[index, 'activity_month'] = date.strftime(row['activity_month'],'%Y-%m-%d')

product_df['pct_ch'] = product_df.groupby('prod_desc')['prod_count'].pct_change()

product_df = product_df.sort(['prod_desc','activity_month'])

What I get returned:

   prod_desc activity_month  prod_count    pct_ch
3      product_a     2013-11-30         52 -0.018868
9      product_a     2013-12-31         50  0.000000
0      product_a     2014-01-31         53       NaN
12     product_a     2014-02-28         52  0.040000
6      product_a     2014-03-31         50 -0.038462
4      product_b     2013-11-30         43 -0.022727
10     product_b     2013-12-31         41  0.000000
1      product_b     2014-01-31         44       NaN
13     product_b     2014-02-28         43  0.048780
7      product_b     2014-03-31         41 -0.046512
5      product_c     2013-11-30         35 -0.027778
11     product_c     2013-12-31         34  0.000000
2      product_c     2014-01-31         36       NaN
14     product_c     2014-02-28         35  0.029412
8      product_c     2014-03-31         34 -0.028571

The calculations here are out of order as the pct_change for the first month of each product should be NaN.

I think the issue is with the pct_change calculation not including 'activity_month' in the groupby. When I try to add it I get the following outputs.

product_df['pct_ch'] = product_df.groupby(['prod_desc','activity_month'])['prod_count'].pct_change() 

   prod_desc activity_month  prod_count  pct_ch
3      product_a     2013-11-30         52     NaN
9      product_a     2013-12-31         50     NaN
0      product_a     2014-01-31         53     NaN
12     product_a     2014-02-28         52     NaN
6      product_a     2014-03-31         50     NaN
4      product_b     2013-11-30         43     NaN
10     product_b     2013-12-31         41     NaN
1      product_b     2014-01-31         44     NaN
13     product_b     2014-02-28         43     NaN
7      product_b     2014-03-31         41     NaN
5      product_c     2013-11-30         35     NaN
11     product_c     2013-12-31         34     NaN
2      product_c     2014-01-31         36     NaN
14     product_c     2014-02-28         35     NaN
8      product_c     2014-03-31         34     NaN

So i think the issue you have is that the groupby is calculating the percentage difference between adjacent rows of identical prod_desc and this isn't ordered in date order when you perform the operation so moving the sort above the groupby will fix that issue. You can also remove the for loop and write that as one line using pandas.

import pandas as pd 

data = [
('product_a','1/31/2014',53)
,('product_b','1/31/2014',44)
,('product_c','1/31/2014',36)
,('product_a','11/30/2013',52)
,('product_b','11/30/2013',43)
,('product_c','11/30/2013',35)
,('product_a','3/31/2014',50)
,('product_b','3/31/2014',41)
,('product_c','3/31/2014',34)
,('product_a','12/31/2013',50)
,('product_b','12/31/2013',41)
,('product_c','12/31/2013',34)
,('product_a','2/28/2014',52)
,('product_b','2/28/2014',43)
,('product_c','2/28/2014',35)
]

product_df = pd.DataFrame( data, columns=['prod_desc','activity_month','prod_count'])

product_df['activity_month'] = pd.to_datetime(product_df['activity_month'],
 format='%m/%d/%Y')

product_df = product_df.sort_values(['prod_desc','activity_month'])
product_df['pct_ch'] = product_df.groupby('prod_desc')['prod_count'].pct_change()

I think this should produce the answer you want.

    prod_desc activity_month  prod_count    pct_ch
3   product_a     2013-11-30          52       NaN
9   product_a     2013-12-31          50 -0.038462
0   product_a     2014-01-31          53  0.060000
12  product_a     2014-02-28          52 -0.018868
6   product_a     2014-03-31          50 -0.038462
4   product_b     2013-11-30          43       NaN
10  product_b     2013-12-31          41 -0.046512
1   product_b     2014-01-31          44  0.073171
13  product_b     2014-02-28          43 -0.022727
7   product_b     2014-03-31          41 -0.046512
5   product_c     2013-11-30          35       NaN
11  product_c     2013-12-31          34 -0.028571
2   product_c     2014-01-31          36  0.058824
14  product_c     2014-02-28          35 -0.027778
8   product_c     2014-03-31          34 -0.028571

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM