简体   繁体   中英

Multiple Conditions Sum Between Two Pandas Dataframes

I am new to Pandas and am having a hard time finding this sum. Suppose I have the following dataframes:

df1 = pd.DataFrame({'Product1': ['A', 'A', 'B'], 
                    'Date1': ['2019-02-01', '2019-12-15', '2019-03-01']})
df1['Date1'] = pd.to_datetime(df1['Date1'])
df1

Product1  Date1
A         2019-02-01
A         2019-12-15
B         2019-03-01

and

df2 = pd.DataFrame({'Product2': ['A', 'A', 'A', 'C', 'B', 'B'], 
                    'Date2': ['2019-01-01', '2019-02-01', '2019-06-01', '2019-06-30', '2019-01-31', '2019-02-15'],
                    'Payment': [100, 150, 100, 500, 25, 25]})
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2

Product2  Date2        Payment
A         2019-01-01   100
A         2019-02-01   150
A         2019-06-01   100
C         2019-06-30   500
B         2019-01-31   25
B         2019-02-15   25

I would like to obtain the following result

Product1  Date1        Total
A         2019-02-01   250
A         2019-12-15   350
B         2019-03-01   50

where df1['Total'] is the sum of df2['Payment'] when df2['Product2'] = df1['Product1'] and df2['Date2'] <= df1['Date1'] .

The best I have done so far is using the conditional loop as follows

sum_list = []
for col1, col2 in zip(df1['Product1'].values, df1['Date1'].values):
    cond = (df2['Product2'] == col1) & (df2['Date2'] <= col2)
    sum_list.append(df2[cond]['Payment'].sum())

df1['Total'] = pd.DataFrame(sum_list)
df1

but it is extremely slow when dealing with million rows. I believe there must be a better way.

Using merge , query , GroupBy.sum :

mrg = df1.merge(df2, left_on='Product1', right_on='Product2')

mrg.query('Date2 <= Date1').groupby(['Product1', 'Date1'], as_index=False)['Payment'].sum()

Output

  Product1      Date1  Payment
0        A 2019-02-01      250
1        A 2019-12-15      350
2        B 2019-03-01       50

You can use a join upfront, and then filter out the rows according to your criteria. A solution can look something like this:

import pandas as pd
df1 = pd.DataFrame({'Product1': ['A', 'A', 'B'], 
                    'Date1': ['2019-02-01', '2019-12-15', '2019-03-01']})
df1['Date1'] = pd.to_datetime(df1['Date1'])

df2 = pd.DataFrame({'Product2': ['A', 'A', 'A', 'C', 'B', 'B'], 
                    'Date2': ['2019-01-01', '2019-02-01', '2019-06-01', '2019-06-30', '2019-01-31', '2019-02-15'],
                    'Payment': [100, 150, 100, 500, 25, 25]})
df2['Date2'] = pd.to_datetime(df2['Date2'])

#Join according to the products in both tables
temp = pd.merge(df1, df2, how='left', left_on='Product1', right_on='Product2')
#only keep those rows that meet the date criteria
temp = temp[temp['Date2'] <= temp['Date1']]
#aggregate results according to product and date.
out = temp.groupby(['Product1', 'Date1']).agg({'Payment':'sum'})
#Optional: If you wish to get back the columns normally
out = out.reset_index()

print(out.head())
  Product1      Date1  Payment
0        A 2019-02-01      250
1        A 2019-12-15      350
2        B 2019-03-01       50

Here is an approach that mirrors that in the OP. Rather than performing a one-to-many merge, this loops through each row of df1 and performs the check and calculations. It is likely a bit faster than the original as it employs a list comprehension rather than growing a list, though slower than a merge. The upside is that it will use less memory.

df1['Payment'] = \
   [df2.loc[(df1.iloc[i, 0] == df2.Product2) & (df1.iloc[i, 1] > df2.Date2),
            'Payment'].sum()
    for i in range(df1.shape[0])]

This returns the desired result

df1
  Product1      Date1  Payment
0        A 2019-02-01      100
1        A 2019-12-15      350
2        B 2019-03-01       50

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM