简体   繁体   English

基于 pandas/python 条件的加权平均计算

[英]Weighted Average calculation based on conditions in pandas/python

I have 2 dataframes for Promotion Data and Sales Data as below.我有 2 个Promotion DataSales Data的数据dataframes ,如下所示。

+-----------+----------+
| Promotion |   Date   |
+-----------+----------+
|     A     | 5-Jan-21 |
+-----------+----------+
|     B     | 8-Jan-21 |
+-----------+----------+
|     C     | 8-Jan-21 |
+-----------+----------+

df_prom = pd.DataFrame({
    'Promotion':['A','B','C'],
    'Date':['5-Jan-21','8-Jan-21','8-Jan-21']  })

+-----------+-------+
|    Date   | Sales |
+-----------+-------+
|  1-Jan-21 | 1,140 |
+-----------+-------+
|  2-Jan-21 | 3,046 |
+-----------+-------+
|  3-Jan-21 | 2,981 |
+-----------+-------+
|  4-Jan-21 | 2,262 |
+-----------+-------+
|  5-Jan-21 | 3,266 |
+-----------+-------+
|  6-Jan-21 | 3,231 |
+-----------+-------+
|  7-Jan-21 | 2,979 |
+-----------+-------+
|  8-Jan-21 | 1,687 |
+-----------+-------+
|  9-Jan-21 | 2,728 |
+-----------+-------+
| 10-Jan-21 | 1,136 |
+-----------+-------+
| 11-Jan-21 | 3,159 |
+-----------+-------+
| 12-Jan-21 | 1,799 |
+-----------+-------+

    df_sales = pd.DataFrame({
        'Date':['1-Jan-21', '2-Jan-21', '3-Jan-21', '4-Jan-21', '5-Jan-21', '6-Jan-21', '7-Jan-21',
 '8-Jan-21', '9-Jan-21', '10-Jan-21', '11-Jan-21', '12-Jan-21'],
        'Sales':[1140, 3046, 2981, 2262, 3266, 3231, 2979, 1687, 2728, 1136, 3159, 1799]
    })

My task is to calculate a weighted average Prior 3 days & Post 3 days sales considering all 3 promotions.我的任务是计算weighted average Prior 3 days Post 3 days sales ,考虑所有 3 次促销活动。

Meaning, All 3 promotions has different dates.意思是,所有 3 个促销活动都有不同的日期。 I need to bring it to a common Prior 3 days & common Post 3 days .我需要把它带到一个普通的Prior 3 days和 common Post 3 days

Step 1:步骤1:

Eg Promotion A is on 5-Jan-21 , so Prior 3 days would be from 2-Jan-21 to 4-Jan-21 .例如Promotion A是在5-Jan-21 ,所以Prior 3 days是从2-Jan-21 to 4-Jan-21

so the average would be 2763 (average of 3046, 2981, 2262) that of Post 3 Days would be 6-Jan-21 to 8-Jan-21 .so the average would be 2632 (average of 3231, 2979, 1687)所以平均值为2763 (平均值为 3046、2981、2262), Post 3 Days的平均值6-Jan-21 to 8-Jan-21所以平均值为2632 (平均值为 3231、2979、1687)

in the case of Promotion B is on 8-Jan-21 , Prior 3 days would be 5-Jan-21 to 7-Jan-21 averages is equal to 3159 (average of 3266, 3231, 2979).如果Promotion B8-Jan-21Prior 3 days将是5-Jan-21 to 7-Jan-21平均值等于3159 (平均值为 3266、3231、2979)。

The Post 3 days would be 9-Jan-21 to 11-Jan-21 average is 2341 (average of 2728, 1136, 3159). Post 3 days将是9-Jan-21 to 11-Jan-21平均值为2341 (平均值为 2728、1136、3159)。

For C it is same as B .对于C它与B相同。 since, the dates are same.因为,日期是一样的。

Step 2:第2步:

Once after calculating the Prior 3 of A, B and C individually.在分别计算A, B and CPrior 3之后。 I must averages this together.我必须把这个平均起来。 ie average would be 3027 (Average of 2763 of A , 3159 of B and 3159 of C ).即平均值为3027A的 2763、 B的 3159 和C的 3159 的平均值)。 same applied to Post 3 days average.这同样适用于Post 3 days的平均值。 which is equal to 2438 (Average of 2632 of A , 2341 of B and 2341 of C ).等于2438A的 2632、 B的 2341 和C的 2341 的平均值)。

so my final answer should look like所以我的最终答案应该看起来像

+--------------+---------+
| Type         | Average |
+--------------+---------+
| Prior 3 days | 3,027   |
+--------------+---------+
| Post 3 days  | 2,438   |
+--------------+---------+

please guide me on how should I approach to solve this.请指导我应该如何解决这个问题。

Sample Data:样本数据:

df_sales = pd.DataFrame({
        'Date':['1-Jan-21', '2-Jan-21', '3-Jan-21', '4-Jan-21', '5-Jan-21', '6-Jan-21', '7-Jan-21',
 '8-Jan-21', '9-Jan-21', '10-Jan-21', '11-Jan-21', '12-Jan-21'],
        'Sales':[1140, 3046, 2981, 2262, 3266, 3231, 2979, 1687, 2728, 1136, 3159, 1799]
    })

df_prom = pd.DataFrame({
    'Promotion':['A','B', 'C'],
    'Date':['5-Jan-21','8-Jan-21', '8-Jan-21']  })

Steps:脚步:

df_proms = df_prom.groupby('Date').count().reset_index()

df = df_sales.merge(df_proms, on='Date', how='left')

df['rolling'] = df['Sales'].rolling(3).mean()

df['post 3 days'] = df['rolling'].shift(-3) * df['Promotion']
df['prior 3 days'] = df['rolling'].shift(1) * df['Promotion']

df = df[~df.Promotion.isnull()]

weighted_df = pd.DataFrame(data=df[['post 3 days', 'prior 3 days']].sum()/df['Promotion'].sum()).reset_index().rename({"index": "Type", 0: "Average"}, axis=1)

weighted_df
    Type    Average
0   post 3 days     2438.111111
1   prior 3 days    3026.777778

Here is solution working with overlapping values, because each datetime is processing separately.这是使用重叠值的解决方案,因为每个日期时间都是单独处理的。

For correct working is necessary all datetimes before and after 3 values exist in df_sales['Date'] and are sorted.为了正确工作, df_sales['Date']中存在 3 个值之前和之后的所有日期时间并进行排序。

First convert values to datetimes:首先将值转换为日期时间:

df_prom['Date'] = pd.to_datetime(df_prom['Date'], format='%d-%b-%y')
df_sales['Date'] = pd.to_datetime(df_sales['Date'], format='%d-%b-%y')

Then repeat Date column to DataFrame with number of columns like values in df_prom :然后将Date列重复到 DataFrame ,列数类似于df_prom中的值:

arr = np.broadcast_to(df_sales['Date'].to_numpy()[:, None],
                     (df_sales.shape[0], df_prom.shape[0]))

df = pd.DataFrame(arr)

Compare datetimes and forward 3 and back filling 3 values for 3 previous and 3 next datetimes, this mask is used for filter Sales :比较日期时间并向前 3 和向后填充 3 个前一个和下一个日期时间的 3 个值,此掩码用于过滤器Sales

m = df.eq(df_prom['Date'])
prev_mask = df.where(m).bfill(limit=3).mask(m).notna()
next_mask = df.where(m).ffill(limit=3).mask(m).notna()

prev = np.where(prev_mask, df_sales['Sales'].to_numpy()[:, None], np.nan)
next1 = np.where(next_mask, df_sales['Sales'].to_numpy()[:, None], np.nan)

print (prev)
[[  nan   nan   nan]
 [3046.   nan   nan]
 [2981.   nan   nan]
 [2262.   nan   nan]
 [  nan 3266. 3266.]
 [  nan 3231. 3231.]
 [  nan 2979. 2979.]
 [  nan   nan   nan]
 [  nan   nan   nan]
 [  nan   nan   nan]
 [  nan   nan   nan]
 [  nan   nan   nan]]
print (next1)
[[  nan   nan   nan]
 [  nan   nan   nan]
 [  nan   nan   nan]
 [  nan   nan   nan]
 [  nan   nan   nan]
 [3231.   nan   nan]
 [2979.   nan   nan]
 [1687.   nan   nan]
 [  nan 2728. 2728.]
 [  nan 1136. 1136.]
 [  nan 3159. 3159.]
 [  nan   nan   nan]]

And last get mean with omit missing values:最后通过省略缺失值得到平均值:

fin = pd.DataFrame({'Type':['Prior 3 days','Post 3 days'],
                    'Average':[np.nanmean(prev), np.nanmean(next1)]
                    })
print (fin)
           Type      Average
0  Prior 3 days  3026.777778
1   Post 3 days  2438.111111

EDIT:编辑:

For dynamic limits use:对于动态限制,请使用:

limits = (pd.to_datetime('12-Jan-2021') - df_prom['Date']).dt.days

d = dict(enumerate(limits))
print (d)

prev_mask = df.where(m).apply(lambda x: x.bfill(limit=d[x.name])).mask(m).notna()
next_mask = df.where(m).apply(lambda x: x.ffill(limit=d[x.name])).mask(m).notna()

print (prev)
[[1140.   nan   nan]
 [3046.   nan   nan]
 [2981.   nan   nan]
 [2262. 2262. 2262.]
 [  nan 3266. 3266.]
 [  nan 3231. 3231.]
 [  nan 2979. 2979.]
 [  nan   nan   nan]
 [  nan   nan   nan]
 [  nan   nan   nan]
 [  nan   nan   nan]
 [  nan   nan   nan]]

print (next1)
[[  nan   nan   nan]
 [  nan   nan   nan]
 [  nan   nan   nan]
 [  nan   nan   nan]
 [  nan   nan   nan]
 [3231.   nan   nan]
 [2979.   nan   nan]
 [1687.   nan   nan]
 [2728. 2728. 2728.]
 [1136. 1136. 1136.]
 [3159. 3159. 3159.]
 [1799. 1799. 1799.]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM