简体   繁体   中英

How can I merge two DataFrames based on a filtered sum?

I have two DataFrames that look like this

   start_date    end_date
1  2018-01-01  2018-01-31
2  2018-01-15  2018-02-28
3  2018-01-31  2018-03-15
4  2018-01-07  2018-04-30

            value
2018-01-01      1
2018-01-02      4
2018-01-03      2
2018-01-04     10
2018-01-05      0
...           ...
2018-12-28      1
2018-12-29      7
2018-12-30      9
2018-12-31      5

I'm trying to add a new column to the first DataFrame that contains the summed values of the second DataFrame, filtered by start_date and end_date . Something like

   start_date    end_date  total_value
1  2018-01-01  2018-01-31           47  # Where 47 is the sum of values between 2018-01-01 and 2018-01-31, inclusive
2  2018-01-15  2018-02-28           82
3  2018-01-31  2018-03-15          116
4  2018-01-07  2018-04-30          253

I think I can do this with apply (basically just filter and sum the second DataFrame by start_date and end_date and return the sum), but I'm wondering if there's a neat pandas-esque solution instead.

NEW ANSWER

I'm using OP data and it needs to be massaged slightly

df2 = df2.asfreq('D').fillna(0, downcast='infer')

Then we do the cumsum thing with an added shift.

s = df2.value.cumsum()
starts = df1.start_date.map(s.shift().fillna(0, downcast='infer'))
ends = df1.end_date.map(s)

df1.assign(total_value=ends - starts)

  start_date   end_date  total_value
1 2018-01-01 2018-01-31           17
2 2018-01-15 2018-02-28            0
3 2018-01-31 2018-03-15            0
4 2018-01-07 2018-04-30            0

OLD ANSWER

COOL, but inaccurate. This is the sum of numbers after the start date. In order to include start date, I have to use shift. See above.

You can use cumsum and take differences.

df1.assign(
    total_value=df1.applymap(df2.cumsum().value.get).eval('end_date - start_date'))

  start_date   end_date  total_value
1 2018-01-01 2018-01-31          145
2 2018-01-15 2018-02-28          229
3 2018-01-31 2018-03-15          212
4 2018-01-07 2018-04-30          535

Setup

np.random.seed([3, 1415])

min_date = df1.values.min()
max_date = df1.values.max()
tidx = pd.date_range(min_date, max_date)
df2 = pd.DataFrame(dict(value=np.random.randint(10, size=len(tidx))), tidx)

Setup

df2.reset_index(inplace=True)

Create your conditions using a loop and zip (It's important that output matches the index of your df1 )

conditions = [df2['index'].between(i, j) for i, j in zip(df1.start_date, df1.end_date)]
output = df1.index

Use np.select , then groupby :

tmp = df2.assign(flag=np.select(conditions, output, np.nan))
tmp = tmp.dropna().groupby('flag').value.sum()

Finally merge:

df1.merge(tmp.to_frame(), left_index=True, right_index=True)

Output:

    start_date   end_date  value
1.0 2018-01-01 2018-01-31     17

Notice this will be O(m*n) method , create a new key for merge

df1['Newkey']=1    
df2['Newkey']=1    
df2.reset_index(inplace=True)    
mergefilterdf=df1.merge(df2).\
                loc[lambda  x : (x['start_date']<=x['index'])&(x['end_date']>=x['index'])]
mergefilterdf.groupby(['start_date','end_date']).value.sum()
Out[331]: 
start_date  end_date  
2018-01-01  2018-01-31    17
Name: value, dtype: int64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM