I have two DataFrames that look like this
start_date end_date
1 2018-01-01 2018-01-31
2 2018-01-15 2018-02-28
3 2018-01-31 2018-03-15
4 2018-01-07 2018-04-30
value
2018-01-01 1
2018-01-02 4
2018-01-03 2
2018-01-04 10
2018-01-05 0
... ...
2018-12-28 1
2018-12-29 7
2018-12-30 9
2018-12-31 5
I'm trying to add a new column to the first DataFrame that contains the summed values of the second DataFrame, filtered by start_date
and end_date
. Something like
start_date end_date total_value
1 2018-01-01 2018-01-31 47 # Where 47 is the sum of values between 2018-01-01 and 2018-01-31, inclusive
2 2018-01-15 2018-02-28 82
3 2018-01-31 2018-03-15 116
4 2018-01-07 2018-04-30 253
I think I can do this with apply
(basically just filter and sum the second DataFrame by start_date
and end_date
and return the sum), but I'm wondering if there's a neat pandas-esque solution instead.
I'm using OP data and it needs to be massaged slightly
df2 = df2.asfreq('D').fillna(0, downcast='infer')
Then we do the cumsum
thing with an added shift.
s = df2.value.cumsum()
starts = df1.start_date.map(s.shift().fillna(0, downcast='infer'))
ends = df1.end_date.map(s)
df1.assign(total_value=ends - starts)
start_date end_date total_value
1 2018-01-01 2018-01-31 17
2 2018-01-15 2018-02-28 0
3 2018-01-31 2018-03-15 0
4 2018-01-07 2018-04-30 0
COOL, but inaccurate. This is the sum of numbers after the start date. In order to include start date, I have to use shift. See above.
You can use cumsum
and take differences.
df1.assign(
total_value=df1.applymap(df2.cumsum().value.get).eval('end_date - start_date'))
start_date end_date total_value
1 2018-01-01 2018-01-31 145
2 2018-01-15 2018-02-28 229
3 2018-01-31 2018-03-15 212
4 2018-01-07 2018-04-30 535
np.random.seed([3, 1415])
min_date = df1.values.min()
max_date = df1.values.max()
tidx = pd.date_range(min_date, max_date)
df2 = pd.DataFrame(dict(value=np.random.randint(10, size=len(tidx))), tidx)
Setup
df2.reset_index(inplace=True)
Create your conditions using a loop and zip
(It's important that output
matches the index of your df1
)
conditions = [df2['index'].between(i, j) for i, j in zip(df1.start_date, df1.end_date)]
output = df1.index
Use np.select
, then groupby
:
tmp = df2.assign(flag=np.select(conditions, output, np.nan))
tmp = tmp.dropna().groupby('flag').value.sum()
Finally merge:
df1.merge(tmp.to_frame(), left_index=True, right_index=True)
Output:
start_date end_date value
1.0 2018-01-01 2018-01-31 17
Notice this will be O(m*n) method , create a new key for merge
df1['Newkey']=1
df2['Newkey']=1
df2.reset_index(inplace=True)
mergefilterdf=df1.merge(df2).\
loc[lambda x : (x['start_date']<=x['index'])&(x['end_date']>=x['index'])]
mergefilterdf.groupby(['start_date','end_date']).value.sum()
Out[331]:
start_date end_date
2018-01-01 2018-01-31 17
Name: value, dtype: int64
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.