df_A
start_date end_date
0 2017-03-01 2017-04-20
1 2017-03-20 2017-04-27
2 2017-04-10 2017-05-25
3 2017-04-17 2017-05-22
df_B
event_date price
0 2017-03-15 100
1 2017-02-22 200
2 2017-04-30 100
3 2017-05-20 150
4 2017-05-23 150
Result
start_date end_date avg.price
0 2017-03-01 2017-04-20 100.0
1 2017-03-20 2017-04-27
2 2017-04-10 2017-05-25 133.3
3 2017-04-17 2017-05-22 125
One way if your dataframes aren't big, is to use cartesian product and filter dataframes.
mapper = df_A.assign(key=1).merge(df_B.assign(key=1))\
.query('start_date <= event_date <= end_date')\
.groupby('start_date')['price'].mean()
df_A['avg.price'] = df_A['start_date'].map(mapper)
print(df_A)
Output:
start_date end_date avg.price
0 2017-03-01 2017-04-20 100.000000
1 2017-03-20 2017-04-27 NaN
2 2017-04-10 2017-05-25 133.333333
3 2017-04-17 2017-05-22 125.000000
Otherwise see this so post
conditional_join from pyjanitor may be helpful in the abstraction/convenience; the function is currently in dev:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
(df_B.conditional_join(
df_A,
('event_date', 'start_date', '>='),
('event_date', 'end_date', '<='),
how = 'right')
.droplevel(level = 0, axis = 1)
.loc[:, ['price', 'start_date', 'end_date']]
.groupby(['start_date', 'end_date'])
.agg(avg_price = ('price', 'mean'))
)
avg_price
start_date end_date
2017-03-01 2017-04-20 100.000000
2017-03-20 2017-04-27 NaN
2017-04-10 2017-05-25 133.333333
2017-04-17 2017-05-22 125.000000
Under the hood it uses a binary search (np.searchsorted) to avoid the Cartesian product. If your intervals were not overlapping, a pd.IntervalIndex
would be a more efficient option.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.