[英]Weighted Average calculation based on conditions in pandas/python
I have 2 dataframes
for Promotion Data
and Sales Data
as below.我有 2 个
Promotion Data
和Sales Data
的数据dataframes
,如下所示。
+-----------+----------+
| Promotion | Date |
+-----------+----------+
| A | 5-Jan-21 |
+-----------+----------+
| B | 8-Jan-21 |
+-----------+----------+
| C | 8-Jan-21 |
+-----------+----------+
df_prom = pd.DataFrame({
'Promotion':['A','B','C'],
'Date':['5-Jan-21','8-Jan-21','8-Jan-21'] })
+-----------+-------+
| Date | Sales |
+-----------+-------+
| 1-Jan-21 | 1,140 |
+-----------+-------+
| 2-Jan-21 | 3,046 |
+-----------+-------+
| 3-Jan-21 | 2,981 |
+-----------+-------+
| 4-Jan-21 | 2,262 |
+-----------+-------+
| 5-Jan-21 | 3,266 |
+-----------+-------+
| 6-Jan-21 | 3,231 |
+-----------+-------+
| 7-Jan-21 | 2,979 |
+-----------+-------+
| 8-Jan-21 | 1,687 |
+-----------+-------+
| 9-Jan-21 | 2,728 |
+-----------+-------+
| 10-Jan-21 | 1,136 |
+-----------+-------+
| 11-Jan-21 | 3,159 |
+-----------+-------+
| 12-Jan-21 | 1,799 |
+-----------+-------+
df_sales = pd.DataFrame({
'Date':['1-Jan-21', '2-Jan-21', '3-Jan-21', '4-Jan-21', '5-Jan-21', '6-Jan-21', '7-Jan-21',
'8-Jan-21', '9-Jan-21', '10-Jan-21', '11-Jan-21', '12-Jan-21'],
'Sales':[1140, 3046, 2981, 2262, 3266, 3231, 2979, 1687, 2728, 1136, 3159, 1799]
})
My task is to calculate a weighted average
Prior 3 days
& Post 3 days sales
considering all 3 promotions.我的任务是计算
weighted average
Prior 3 days
Post 3 days sales
,考虑所有 3 次促销活动。
Meaning, All 3 promotions has different dates.意思是,所有 3 个促销活动都有不同的日期。 I need to bring it to a common
Prior 3 days
& common Post 3 days
.我需要把它带到一个普通的
Prior 3 days
和 common Post 3 days
。
Step 1:步骤1:
Eg Promotion A
is on 5-Jan-21
, so Prior 3 days
would be from 2-Jan-21 to 4-Jan-21
.例如
Promotion A
是在5-Jan-21
,所以Prior 3 days
是从2-Jan-21 to 4-Jan-21
。
so the average would be 2763
(average of 3046, 2981, 2262) that of Post 3 Days
would be 6-Jan-21 to 8-Jan-21
.so the average would be 2632
(average of 3231, 2979, 1687)所以平均值为
2763
(平均值为 3046、2981、2262), Post 3 Days
的平均值6-Jan-21 to 8-Jan-21
所以平均值为2632
(平均值为 3231、2979、1687)
in the case of Promotion B
is on 8-Jan-21
, Prior 3 days
would be 5-Jan-21 to 7-Jan-21
averages is equal to 3159
(average of 3266, 3231, 2979).如果
Promotion B
是8-Jan-21
, Prior 3 days
将是5-Jan-21 to 7-Jan-21
平均值等于3159
(平均值为 3266、3231、2979)。
The Post 3 days
would be 9-Jan-21 to 11-Jan-21
average is 2341
(average of 2728, 1136, 3159). Post 3 days
将是9-Jan-21 to 11-Jan-21
平均值为2341
(平均值为 2728、1136、3159)。
For C
it is same as B
.对于
C
它与B
相同。 since, the dates are same.因为,日期是一样的。
Step 2:第2步:
Once after calculating the Prior 3
of A, B and C
individually.在分别计算
A, B and C
的Prior 3
之后。 I must averages this together.我必须把这个平均起来。 ie average would be
3027
(Average of 2763 of A
, 3159 of B
and 3159 of C
).即平均值为
3027
( A
的 2763、 B
的 3159 和C
的 3159 的平均值)。 same applied to Post 3 days
average.这同样适用于
Post 3 days
的平均值。 which is equal to 2438
(Average of 2632 of A
, 2341 of B
and 2341 of C
).等于
2438
( A
的 2632、 B
的 2341 和C
的 2341 的平均值)。
so my final answer should look like所以我的最终答案应该看起来像
+--------------+---------+
| Type | Average |
+--------------+---------+
| Prior 3 days | 3,027 |
+--------------+---------+
| Post 3 days | 2,438 |
+--------------+---------+
please guide me on how should I approach to solve this.请指导我应该如何解决这个问题。
Sample Data:样本数据:
df_sales = pd.DataFrame({
'Date':['1-Jan-21', '2-Jan-21', '3-Jan-21', '4-Jan-21', '5-Jan-21', '6-Jan-21', '7-Jan-21',
'8-Jan-21', '9-Jan-21', '10-Jan-21', '11-Jan-21', '12-Jan-21'],
'Sales':[1140, 3046, 2981, 2262, 3266, 3231, 2979, 1687, 2728, 1136, 3159, 1799]
})
df_prom = pd.DataFrame({
'Promotion':['A','B', 'C'],
'Date':['5-Jan-21','8-Jan-21', '8-Jan-21'] })
Steps:脚步:
df_proms = df_prom.groupby('Date').count().reset_index()
df = df_sales.merge(df_proms, on='Date', how='left')
df['rolling'] = df['Sales'].rolling(3).mean()
df['post 3 days'] = df['rolling'].shift(-3) * df['Promotion']
df['prior 3 days'] = df['rolling'].shift(1) * df['Promotion']
df = df[~df.Promotion.isnull()]
weighted_df = pd.DataFrame(data=df[['post 3 days', 'prior 3 days']].sum()/df['Promotion'].sum()).reset_index().rename({"index": "Type", 0: "Average"}, axis=1)
weighted_df
Type Average
0 post 3 days 2438.111111
1 prior 3 days 3026.777778
Here is solution working with overlapping values, because each datetime is processing separately.这是使用重叠值的解决方案,因为每个日期时间都是单独处理的。
For correct working is necessary all datetimes before and after 3 values exist in df_sales['Date']
and are sorted.为了正确工作,
df_sales['Date']
中存在 3 个值之前和之后的所有日期时间并进行排序。
First convert values to datetimes:首先将值转换为日期时间:
df_prom['Date'] = pd.to_datetime(df_prom['Date'], format='%d-%b-%y')
df_sales['Date'] = pd.to_datetime(df_sales['Date'], format='%d-%b-%y')
Then repeat Date
column to DataFrame with number of columns like values in df_prom
:然后将
Date
列重复到 DataFrame ,列数类似于df_prom
中的值:
arr = np.broadcast_to(df_sales['Date'].to_numpy()[:, None],
(df_sales.shape[0], df_prom.shape[0]))
df = pd.DataFrame(arr)
Compare datetimes and forward 3 and back filling 3 values for 3 previous and 3 next datetimes, this mask is used for filter Sales
:比较日期时间并向前 3 和向后填充 3 个前一个和下一个日期时间的 3 个值,此掩码用于过滤器
Sales
:
m = df.eq(df_prom['Date'])
prev_mask = df.where(m).bfill(limit=3).mask(m).notna()
next_mask = df.where(m).ffill(limit=3).mask(m).notna()
prev = np.where(prev_mask, df_sales['Sales'].to_numpy()[:, None], np.nan)
next1 = np.where(next_mask, df_sales['Sales'].to_numpy()[:, None], np.nan)
print (prev)
[[ nan nan nan]
[3046. nan nan]
[2981. nan nan]
[2262. nan nan]
[ nan 3266. 3266.]
[ nan 3231. 3231.]
[ nan 2979. 2979.]
[ nan nan nan]
[ nan nan nan]
[ nan nan nan]
[ nan nan nan]
[ nan nan nan]]
print (next1)
[[ nan nan nan]
[ nan nan nan]
[ nan nan nan]
[ nan nan nan]
[ nan nan nan]
[3231. nan nan]
[2979. nan nan]
[1687. nan nan]
[ nan 2728. 2728.]
[ nan 1136. 1136.]
[ nan 3159. 3159.]
[ nan nan nan]]
And last get mean with omit missing values:最后通过省略缺失值得到平均值:
fin = pd.DataFrame({'Type':['Prior 3 days','Post 3 days'],
'Average':[np.nanmean(prev), np.nanmean(next1)]
})
print (fin)
Type Average
0 Prior 3 days 3026.777778
1 Post 3 days 2438.111111
EDIT:编辑:
For dynamic limits use:对于动态限制,请使用:
limits = (pd.to_datetime('12-Jan-2021') - df_prom['Date']).dt.days
d = dict(enumerate(limits))
print (d)
prev_mask = df.where(m).apply(lambda x: x.bfill(limit=d[x.name])).mask(m).notna()
next_mask = df.where(m).apply(lambda x: x.ffill(limit=d[x.name])).mask(m).notna()
print (prev)
[[1140. nan nan]
[3046. nan nan]
[2981. nan nan]
[2262. 2262. 2262.]
[ nan 3266. 3266.]
[ nan 3231. 3231.]
[ nan 2979. 2979.]
[ nan nan nan]
[ nan nan nan]
[ nan nan nan]
[ nan nan nan]
[ nan nan nan]]
print (next1)
[[ nan nan nan]
[ nan nan nan]
[ nan nan nan]
[ nan nan nan]
[ nan nan nan]
[3231. nan nan]
[2979. nan nan]
[1687. nan nan]
[2728. 2728. 2728.]
[1136. 1136. 1136.]
[3159. 3159. 3159.]
[1799. 1799. 1799.]]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.