简体   繁体   中英

aggregate data by quarter

I have a pivot pandas data frame (sales by region) that got created from another pandas data frame (sales by store) using the pivot_table method.

As an example:

df = pd.DataFrame(
    {'store':['A','B','C','D','E']*7, 
     'region':['NW','NW','SW','NE','NE']*7, 
     'date':['2017-03-30']*5+['2017-04-05']*5+['2017-04-07']*5+['2017-04-12']*5+['2017-04-13']*5+['2017-04-17']*5+['2017-04-20']*5,
     'sales':[30,1,133,9,1,30,3,135,9,11,30,1,140,15,15,25,10,137,9,3,29,10,137,9,11,30,19,145,20,10,30,8,141,25,25]
     })
df['date'] = pd.to_datetime(df['date'])
df_sales = df.pivot_table(index = ['region'], columns = ['date'], aggfunc = [np.sum], margins = True)
df_sales = df_sales.ix[:,range(0, df_sales.shape[1]-1)]

My goal is to do the following to the sales data frame, df_sales.

Create a new dataframe that summarizes sales by quarter. I could use the original dataframe df, or the sales_df.

As of quarter here we only have only two quarters ( USA fiscal calendar year ) so the quarterly aggregated data frame would look like:

2017Q1  2017Q2
10      27
31      37.5
133     139.17

I take the average for all days in Q1, and same for Q2. Thus, for example for the North east region, 'NE' , the Q1 is the average of only one day 2017-03-30, ie, 10, and for the Q2 is the average across 2017-04-05 to 2017-04-20, ie,

(20+30+12+20+30+50)/6=27

Any suggestions?

ADDITIONAL NOTE: I would ideally do the quarter aggregations on the df_sales pivoted table since it's a much smaller dataframe to keep in memory. The current solution does it on the original df, but I am still seeking a way to do it in the df_sales dataframe.

UPDATE:

Setup:

df.date = pd.to_datetime(df.date)

df_sales = df.pivot_table(index='region', columns='date', values='sales', aggfunc='sum')

In [318]: df_sales
Out[318]:
date    2017-03-30  2017-04-05  2017-04-07  2017-04-12  2017-04-13  2017-04-17  2017-04-20
region
NE              10          20          30          12          20          30          50
NW              31          33          31          35          39          49          38
SW             133         135         140         137         137         145         141

Solution:

In [319]: (df_sales.groupby(pd.PeriodIndex(df_sales.columns, freq='Q'), axis=1)
     ...:          .apply(lambda x: x.sum(axis=1)/x.shape[1])
     ...: )
Out[319]:
date    2017Q1      2017Q2
region
NE        10.0   27.000000
NW        31.0   37.500000
SW       133.0  139.166667

Solution based on the original DF:

In [253]: (df.groupby(['region', pd.PeriodIndex(df.date, freq='Q-DEC')])
     ...:    .apply(lambda x: x['sales'].sum()/x['date'].nunique())
     ...:    .to_frame('avg').unstack('date')
     ...: )
     ...:
Out[253]:
          avg
date   2017Q1      2017Q2
region
NE       10.0   27.000000
NW       31.0   37.500000
SW      133.0  139.166667

NOTE: df - is the original DF (before "pivoting")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM