I have the following dataframe of customer sales history (it's just part of it, the actual dataframe is more than 70k rows):
import pandas as pd
import datetime as DT
df_test = pd.DataFrame({
'Cus_ID': ["T313","T348","T313","T348","T313","T348","T329","T329","T348","T313","T329","T348"],
'Value': [3,2,3,4,5,3,7.25,10.25,4.5,11.75,6.25,6],
'Date' : [
DT.datetime(2015,10,18),
DT.datetime(2015,11,14),
DT.datetime(2015,11,18),
DT.datetime(2015,12,13),
DT.datetime(2015,12,19),
DT.datetime(2016,1,24),
DT.datetime(2016,1,31),
DT.datetime(2016,2,17),
DT.datetime(2016,3,28),
DT.datetime(2016,3,31),
DT.datetime(2016,4,3),
DT.datetime(2016,4,16),
]})
I would like to add a new column to the dataframe to show the result of time-weighted average of the last 90 days for that customers.
Expected result (column Value_Result
):
Cus_ID Date Value Value_Result
0 T313 2015-10-18 3.00 NaN (No 90days history)
1 T348 2015-11-14 2.00 NaN (No 90days history)
2 T313 2015-11-18 3.00 3 (3*31)/31
3 T348 2015-12-13 4.00 2 (2*29)/29
4 T313 2015-12-19 5.00 3 (3*62+3*31)/(62+31)
5 T348 2016-01-24 3.00 2.743 (4*42+2*71)/(42+71)
6 T329 2016-01-31 7.25 NaN (No 90days history)
7 T329 2016-02-17 10.25 7.25 (7.25*17)/17
8 T348 2016-03-28 4.50 3 (3*64)/64
9 T313 2016-03-31 11.75 NaN (No 90days history)
10 T329 2016-04-03 6.25 8.516 (10.25*46+7.25*63)/(46+63)
11 T348 2016-04-16 6.00 3.279 (4.5*19+3*83)/(19+83)
I've tried to use groupby('Cus_ID')
and the rolling apply, but I have difficulty writing the function to only consider 90 days backward.
Any input highly appreciated.
I'm not sure the rolling function will be the way to go with a weighted average, although maybe someone else knows how to use it for that I can't promise this will be the most optimized method but it will yield the result you want, you can take this and build upon it if necessary.
Big thanks to this pbpython article . I recommend reading through it.
My approach is to create a function that will be applied to groups (group by Cus_ID). This function will iterate over rows in that group and do the weighted averaging as you describe above, apply this back to the group and return the group. This code snippet is verbose for clarity of explination, you can trim it down by removing all the creation of the variables if desired.
The apply function looks like this
def tw_avg(group, value_col, time_col, new_col_name="time_weighted_average", days_back='-90 days', fill_value=np.nan):
"""
Will calculate the weighted (by day) time average of the group passed.
It will not operate on the day it is evaulating but the previous days_back.
Should be used with the apply() function in Pandas with groupby function
Args:
group (pandas.DataFrame): Will be passed by pandas
value_col (str): Name of column with value to be averaged by weight
time_col (str): Name of column of with times in them
new_col_name (str): Name of new column to place time weighted average into, default: time_weighted_average
days_back (str): Time delta description as described in panda time deltas documentation, default: -90 days
fill_value (any): The value to fill rows which do not have data in days_back period, default: np.nan
Returns:
(pandas.DataFrame): The modified DataFrame with time weighted average added to columns, np.nan if no
time weight average exist
"""
for idx, row in group.iterrows():
# Filter for only values that are days_back for averaging.
days_back_fil = (group[time_col] < row[time_col]) & (group[time_col] >= row[time_col] + pd.Timedelta(days_back))
df = group[days_back_fil]
df['days-back'] = (row[time_col] - df[time_col]) / np.timedelta64(1, 'D') # need to divide by np.timedelta day to get number back
df['weight'] = df[value_col] * df['days-back']
try:
df['tw_avg'] = df['weight'].sum() / df['days-back'].sum()
time_avg = df['tw_avg'].iloc[0] # Get single value of the tw_avg
group.loc[idx, new_col_name] = time_avg
except ZeroDivisionError:
group.loc[idx, new_col_name] = fill_value
return group
You can then return the DataFrame you're looking for with this line
df_test.groupby(by=['Cus_ID']).apply(tw_avg, 'Value', 'Date')
This will yield,
Cus_ID Date Value time_weighted_average
0 T313 2015-10-18 3.0 NaN
1 T348 2015-11-14 2.0 NaN
2 T313 2015-11-18 3.0 3.0
3 T348 2015-12-13 4.0 2.0
4 T313 2015-12-19 5.0 3.0
5 T348 2016-01-24 3.0 2.743362831858407
6 T329 2016-01-31 7.25 NaN
7 T329 2016-02-17 10.25 7.25
8 T348 2016-03-28 4.5 3.0
9 T313 2016-03-31 11.75 NaN
10 T329 2016-04-03 6.25 8.51605504587156
11 T348 2016-04-16 6.0 3.2794117647058822
You can now use that function to apply weighted average to other value columns with the value_col
argument or change the time window length with days_back
argument. See pandas time deltas page for how to describe time deltas.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.