简体   繁体   中英

Python: Cumulative Sum on Multiple IDs with missing row

I have a large dataset with 104 unique dates and 200k SKUs. For this explanation I am using 3 SKUs and 4 dates.

The data is as follows

 Date      SKU        Demand      Supply
 20160501   1            10          10
 20160508   1            35          20
 20160501   2            20          15
 20160508   2            15          20
 20160522   2            5           0
 20160522   3            55          45

The rows are populated only where there is a non-zero Demand or supply. I want to calculate cumulative Demand and Supply, while having continuous Date range for all IDs by adding 0 in the missing date.

My output would be like this

Date       SKU        Demand      Supply    Cum_Demand    Cum_Supply
20160501     1         10         10         10            10
20160508     1         35         20         45            30
20160515     1         0          0          45            30
20160522     1         0          0          45            30
20160501     2         20         15         20            15
20160508     2         15         20         35            35
20160515     2         0          0          35            35
20160522     2         5          0          40            35
20160501     3         0          0          0             0
20160508     3         0          0          0             0
20160515     3         0          0          0             0
20160522     3         55         45         55            45

Code for the dataframe

data = pd.DataFrame({'Date':[20160501,20160508,20160501,20160508,20160522,20160522],
                 'SKU':[1,1,2,2,2,3],
                 'Demand':[10,35,20,15,5,55],
                 'Supply':[10,20,15,20,0,45]}
                ,columns=['Date', 'SKU', 'Demand', 'Supply'])

Need to first reindex , then groupby + cumsum and can concatenate back:

import pandas as pd

idx = pd.MultiIndex.from_product([[20160501,20160508,20160515,20160522], 
                                  data.SKU.unique()], names=['Date', 'SKU'])
#If have all unique dates needed in column then: 
#pd.MultiIndex.from_product([np.unique(data.Date), data.SKU.unique()])

data2 = data.set_index(['Date', 'SKU']).reindex(idx).fillna(0)
data2 = pd.concat([data2, data2.groupby(level=1).cumsum().add_prefix('Cum_')], 1).sort_index(level=1).reset_index()

Output data2 :

        Date  SKU  Demand  Supply  Cum_Demand  Cum_Supply
0   20160501    1    10.0    10.0        10.0        10.0
1   20160508    1    35.0    20.0        45.0        30.0
2   20160515    1     0.0     0.0        45.0        30.0
3   20160522    1     0.0     0.0        45.0        30.0
4   20160501    2    20.0    15.0        20.0        15.0
5   20160508    2    15.0    20.0        35.0        35.0
6   20160515    2     0.0     0.0        35.0        35.0
7   20160522    2     5.0     0.0        40.0        35.0
8   20160501    3     0.0     0.0         0.0         0.0
9   20160508    3     0.0     0.0         0.0         0.0
10  20160515    3     0.0     0.0         0.0         0.0
11  20160522    3    55.0    45.0        55.0        45.0

You will need to be careful about your dates. In this case I explicitly listed the order so earlier dates appeared first. If they are numbers, then you can use np.unique which will sort the values, ensuring dates are ordered. But this relies upon every date appearing in your DataFrame at least once. Otherwise you will need to create your list of ordered dates somehow.

Start by converting the date to datetime format:

df.Date = pd.to_datetime(df.Date, format='%Y%m%d')

You can create a weekly pd.date_range using the existing dates:

ix = pd.date_range(df.Date.min(), df.Date.max() + pd.DateOffset(1), freq="W")

The following step would be to GorupBy SKU , reindex according to the created date range, and choose a filling method according to the column, ffill and bfill to fill all NaNs in the case of SKU and 0 for the Demand and Supply .

df1 = (df.set_index('Date').groupby('SKU').apply(lambda x: x.reindex(ix)[['SKU']])
                          .ffill().bfill().reset_index(0, drop=True))
df2 = (df.set_index('Date').groupby('SKU').apply(lambda x: x.reindex(ix)[['Demand','Supply']])
                          .fillna(0).reset_index(0, drop=True))

The final step is to concatenate the two dataframes, and take the cumsum of Demand and Supply :

df_final = pd.concat([df2,df1],axis=1)

(df_final.assign(**df_final.groupby('SKU')
    .agg({'Demand':'cumsum','Supply':'cumsum'})
    .add_prefix('cum_')))

            SKU   Demand  Supply    cum_Demand  cum_Supply
2016-05-01  1.0    10.0    10.0        10.0        10.0
2016-05-08  1.0    35.0    20.0        45.0        30.0
2016-05-15  1.0     0.0     0.0        45.0        30.0
2016-05-22  1.0     0.0     0.0        45.0        30.0
2016-05-01  2.0    20.0    15.0        20.0        15.0
2016-05-08  2.0    15.0    20.0        35.0        35.0
2016-05-15  2.0     0.0     0.0        35.0        35.0
2016-05-22  2.0     5.0     0.0        40.0        35.0
2016-05-01  3.0     0.0     0.0         0.0         0.0
2016-05-08  3.0     0.0     0.0         0.0         0.0
2016-05-15  3.0     0.0     0.0         0.0         0.0
2016-05-22  3.0    55.0    45.0        55.0        45.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM