I have a number of debit and credit rows in a pandas dataframe (some sample data below):
+----------+-------+--------------+--------+
| Date | Party | Debit/Credit | Amount |
+----------+-------+--------------+--------+
| 9/1/2020 | Wells | Debit | 4 |
| 9/1/2020 | Wells | Credit | -4 |
| 9/1/2020 | Wells | Debit | 4 |
| 9/1/2020 | Wells | Debit | 4 |
| 9/2/2020 | BOA | Credit | -4 |
| 9/2/2020 | BOA | Debit | 4 |
| 9/3/2020 | Chase | Debit | 4 |
+----------+-------+--------------+--------+
I am trying to identify matching pairs of Date/Party and amounts where they offset. For example, on 9/1 you can see offsetting debit and credit transactions with Wells.
What I have tried to do is create a separate Debit dataframe and Credit dataframe, and then merge the two on Date/Party.
df = pd.DataFrame({'Date': ['9/1/2020','9/1/2020', '9/1/2020', '9/1/2020', '9/2/2020', '9/2/2020', '9/3/2020'],
'Party': ['Wells', 'Wells', 'Wells', 'Wells', 'BOA', 'BOA', 'Chase'],
'Debit/Credit': ['Debit', 'Credit', 'Debit', 'Debit', 'Credit', 'Debit', 'Debit'],
'Amount': [4, -4, 4, 4, -4, 4, 4]})
debit_df = df.loc[df['Debit/Credit'] == 'Debit']
credit_df = df.loc[df['Debit/Credit'] == 'Credit']
offset_df= debit_df.merge(credit_df, on = ['Date', 'Party'])
matching_trans = offset_df.loc[offset_df['Amount_x'] == abs(offset_df['Amount_y'])]
The problem with this approach is that I obviously pull a Cartesian product where There are multiple similar Wells transactions. Is there a way to identify just the matching pairs for Wells (ie Debit 4, Credit -4) just the amount of times it occurs? My data is much larger but in this example you would return only 1 result in the final matching_trans
dataframe.
If you only need the number of times this happens you can compare the count of matching instances. First take the counts of similar amounts for each Date/Party for Debit and Credit:
debit_df = df.loc[df['Debit/Credit'] == 'Debit'].groupby(['Date', 'Party', 'Amount']).count().reset_index()
credit_df = df.loc[df['Debit/Credit'] == 'Credit'].groupby(['Date', 'Party', 'Amount']).count().reset_index()
Then change one of the values to the negative amount, so it could be used in matching as well:
credit_df.rename(columns={'Amount':'Credit_Amount'}, inplace=True)
credit_df['Amount'] = -credit_df['Credit_Amount']
In the end match the two dfs on Date, Party and Amount, drop the NAs and find the number of offsets:
matching_trans = debit_df.merge(credit_df, on=['Date', 'Party', 'Amount'], how='left').dropna(axis=0)
matching_trans.rename(columns={'Amount':'Debit_Amount', 'Debit/Credit_x':'Debit_count',
'Debit/Credit_y':'Credit_count'}, inplace=True)
matching_trans['offset_count'] = matching_trans.apply(lambda x: min(x.Credit_count, x.Debit_count),axis=1)
The 'offset_count' will give you the number of offsets you have for each Date/Party combo.
Here is an approach that identifies matching pairs. It's long but not complicated. Make defaultdict for debits and for credits.
from collections import defaultdict
from io import StringIO
import pandas as pd
# create data frame
data = '''
Date Party Debit_Credit Amount
9/1/2020 Wells Debit 4
9/1/2020 Wells Credit -4
9/1/2020 Wells Debit 4
9/1/2020 Wells Debit 4
9/2/2020 BOA Credit -4
9/2/2020 BOA Debit 4
9/3/2020 Chase Debit 4
'''
df = pd.read_csv(StringIO(data), sep='\s+',
engine='python', parse_dates=['Date'])
df = df.reset_index().rename(columns={'index': 'seq_num'})
Next step:
# make a default dictionary for debits
# key => (Date + Party + Amount)
# value => list of seq_num
# same for credits (exept use -1 * Amount)
debits = defaultdict(list)
credits = defaultdict(list)
for row in df.itertuples():
if row.Debit_Credit == 'Debit':
key = (row.Date, row.Party, row.Amount)
debits[key].append(row.seq_num)
elif row.Debit_Credit == 'Credit':
key = (row.Date, row.Party, (-1) * row.Amount)
credits[key].append(row.seq_num)
else:
continue # can't get here!
Now iterate over the debits dict. If the key exists in the credits dict also, then we found a matching pair -- move the sequence numbers to an 'offsets' dict.
offsets = defaultdict(list)
for key, value in debits.items():
# is this key also in credits?
if key in credits:
print(key, 'found offset!')
debit_seq_num = value.pop()
credit_seq_num = credits[key].pop()
offsets[key].append((debit_seq_num, credit_seq_num))
Finally, we can print a little report, by iterating over each dict:
# print report
print('debits')
for key, value in debits.items():
if value:
print(' ', key, value)
print('credits')
for key, value in credits.items():
if value:
print(' ', key, value)
print('offsets')
for key, value in offsets.items():
if value:
print(' ', key, value)
debits
(Timestamp('2020-09-01 00:00:00'), 'Wells', 4) [0, 2]
(Timestamp('2020-09-03 00:00:00'), 'Chase', 4) [6]
credits
offsets
(Timestamp('2020-09-01 00:00:00'), 'Wells', 4) [(3, 1)]
(Timestamp('2020-09-02 00:00:00'), 'BOA', 4) [(5, 4)]
The offsets dict gives pairs of sequence numbers that are offsets. Note that union of debits, credits and offsets is the same as original data frame (we didn't double-count, and we didn't lose anything).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.