I have a large dataset with 104 unique dates and 200k SKUs. For this explanation I am using 3 SKUs and 4 dates.
The data is as follows
Date SKU Demand Supply
20160501 1 10 10
20160508 1 35 20
20160501 2 20 15
20160508 2 15 20
20160522 2 5 0
20160522 3 55 45
The rows are populated only where there is a non-zero Demand or supply. I want to calculate cumulative Demand and Supply, while having continuous Date range for all IDs by adding 0 in the missing date.
My output would be like this
Date SKU Demand Supply Cum_Demand Cum_Supply
20160501 1 10 10 10 10
20160508 1 35 20 45 30
20160515 1 0 0 45 30
20160522 1 0 0 45 30
20160501 2 20 15 20 15
20160508 2 15 20 35 35
20160515 2 0 0 35 35
20160522 2 5 0 40 35
20160501 3 0 0 0 0
20160508 3 0 0 0 0
20160515 3 0 0 0 0
20160522 3 55 45 55 45
Code for the dataframe
data = pd.DataFrame({'Date':[20160501,20160508,20160501,20160508,20160522,20160522],
'SKU':[1,1,2,2,2,3],
'Demand':[10,35,20,15,5,55],
'Supply':[10,20,15,20,0,45]}
,columns=['Date', 'SKU', 'Demand', 'Supply'])
Need to first reindex
, then groupby
+ cumsum
and can concatenate
back:
import pandas as pd
idx = pd.MultiIndex.from_product([[20160501,20160508,20160515,20160522],
data.SKU.unique()], names=['Date', 'SKU'])
#If have all unique dates needed in column then:
#pd.MultiIndex.from_product([np.unique(data.Date), data.SKU.unique()])
data2 = data.set_index(['Date', 'SKU']).reindex(idx).fillna(0)
data2 = pd.concat([data2, data2.groupby(level=1).cumsum().add_prefix('Cum_')], 1).sort_index(level=1).reset_index()
data2
: Date SKU Demand Supply Cum_Demand Cum_Supply
0 20160501 1 10.0 10.0 10.0 10.0
1 20160508 1 35.0 20.0 45.0 30.0
2 20160515 1 0.0 0.0 45.0 30.0
3 20160522 1 0.0 0.0 45.0 30.0
4 20160501 2 20.0 15.0 20.0 15.0
5 20160508 2 15.0 20.0 35.0 35.0
6 20160515 2 0.0 0.0 35.0 35.0
7 20160522 2 5.0 0.0 40.0 35.0
8 20160501 3 0.0 0.0 0.0 0.0
9 20160508 3 0.0 0.0 0.0 0.0
10 20160515 3 0.0 0.0 0.0 0.0
11 20160522 3 55.0 45.0 55.0 45.0
You will need to be careful about your dates. In this case I explicitly listed the order so earlier dates appeared first. If they are numbers, then you can use np.unique
which will sort the values, ensuring dates are ordered. But this relies upon every date appearing in your DataFrame
at least once. Otherwise you will need to create your list of ordered dates somehow.
Start by converting the date
to datetime
format:
df.Date = pd.to_datetime(df.Date, format='%Y%m%d')
You can create a weekly pd.date_range
using the existing dates:
ix = pd.date_range(df.Date.min(), df.Date.max() + pd.DateOffset(1), freq="W")
The following step would be to GorupBy
SKU
, reindex
according to the created date range, and choose a filling method according to the column, ffill
and bfill
to fill all NaNs
in the case of SKU
and 0
for the Demand
and Supply
.
df1 = (df.set_index('Date').groupby('SKU').apply(lambda x: x.reindex(ix)[['SKU']])
.ffill().bfill().reset_index(0, drop=True))
df2 = (df.set_index('Date').groupby('SKU').apply(lambda x: x.reindex(ix)[['Demand','Supply']])
.fillna(0).reset_index(0, drop=True))
The final step is to concatenate the two dataframes, and take the cumsum
of Demand
and Supply
:
df_final = pd.concat([df2,df1],axis=1)
(df_final.assign(**df_final.groupby('SKU')
.agg({'Demand':'cumsum','Supply':'cumsum'})
.add_prefix('cum_')))
SKU Demand Supply cum_Demand cum_Supply
2016-05-01 1.0 10.0 10.0 10.0 10.0
2016-05-08 1.0 35.0 20.0 45.0 30.0
2016-05-15 1.0 0.0 0.0 45.0 30.0
2016-05-22 1.0 0.0 0.0 45.0 30.0
2016-05-01 2.0 20.0 15.0 20.0 15.0
2016-05-08 2.0 15.0 20.0 35.0 35.0
2016-05-15 2.0 0.0 0.0 35.0 35.0
2016-05-22 2.0 5.0 0.0 40.0 35.0
2016-05-01 3.0 0.0 0.0 0.0 0.0
2016-05-08 3.0 0.0 0.0 0.0 0.0
2016-05-15 3.0 0.0 0.0 0.0 0.0
2016-05-22 3.0 55.0 45.0 55.0 45.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.