简体   繁体   中英

Lining up debits and credits in pandas?

I have a number of debit and credit rows in a pandas dataframe (some sample data below):

+----------+-------+--------------+--------+
|   Date   | Party | Debit/Credit | Amount |
+----------+-------+--------------+--------+
| 9/1/2020 | Wells | Debit        |      4 |
| 9/1/2020 | Wells | Credit       |     -4 |
| 9/1/2020 | Wells | Debit        |      4 |
| 9/1/2020 | Wells | Debit        |      4 |
| 9/2/2020 | BOA   | Credit       |     -4 |
| 9/2/2020 | BOA   | Debit        |      4 |
| 9/3/2020 | Chase | Debit        |      4 |
+----------+-------+--------------+--------+

I am trying to identify matching pairs of Date/Party and amounts where they offset. For example, on 9/1 you can see offsetting debit and credit transactions with Wells.

What I have tried to do is create a separate Debit dataframe and Credit dataframe, and then merge the two on Date/Party.

df = pd.DataFrame({'Date': ['9/1/2020','9/1/2020', '9/1/2020', '9/1/2020', '9/2/2020', '9/2/2020', '9/3/2020'],
                  'Party': ['Wells', 'Wells', 'Wells', 'Wells', 'BOA', 'BOA', 'Chase'],
                  'Debit/Credit': ['Debit', 'Credit', 'Debit', 'Debit', 'Credit', 'Debit', 'Debit'],
                  'Amount': [4, -4, 4, 4, -4, 4, 4]})
debit_df = df.loc[df['Debit/Credit'] == 'Debit']
credit_df = df.loc[df['Debit/Credit'] == 'Credit']
offset_df= debit_df.merge(credit_df, on = ['Date', 'Party'])
matching_trans = offset_df.loc[offset_df['Amount_x'] == abs(offset_df['Amount_y'])]

The problem with this approach is that I obviously pull a Cartesian product where There are multiple similar Wells transactions. Is there a way to identify just the matching pairs for Wells (ie Debit 4, Credit -4) just the amount of times it occurs? My data is much larger but in this example you would return only 1 result in the final matching_trans dataframe.

If you only need the number of times this happens you can compare the count of matching instances. First take the counts of similar amounts for each Date/Party for Debit and Credit:

debit_df = df.loc[df['Debit/Credit'] == 'Debit'].groupby(['Date', 'Party', 'Amount']).count().reset_index()
credit_df = df.loc[df['Debit/Credit'] == 'Credit'].groupby(['Date', 'Party', 'Amount']).count().reset_index()

Then change one of the values to the negative amount, so it could be used in matching as well:

credit_df.rename(columns={'Amount':'Credit_Amount'}, inplace=True)
credit_df['Amount'] = -credit_df['Credit_Amount']

In the end match the two dfs on Date, Party and Amount, drop the NAs and find the number of offsets:

matching_trans = debit_df.merge(credit_df, on=['Date', 'Party', 'Amount'], how='left').dropna(axis=0)
matching_trans.rename(columns={'Amount':'Debit_Amount', 'Debit/Credit_x':'Debit_count',
                               'Debit/Credit_y':'Credit_count'}, inplace=True)
matching_trans['offset_count'] = matching_trans.apply(lambda x: min(x.Credit_count, x.Debit_count),axis=1)

The 'offset_count' will give you the number of offsets you have for each Date/Party combo.

Here is an approach that identifies matching pairs. It's long but not complicated. Make defaultdict for debits and for credits.

  • Key is Date + Party + Amount (change sign of Amount for credits).
  • Value is unique ID (I called it seq_num, but it's just the original index).
from collections import defaultdict
from io import StringIO
import pandas as pd

# create data frame
data = '''
   Date    Party  Debit_Credit  Amount 
 9/1/2020  Wells  Debit              4 
 9/1/2020  Wells  Credit            -4 
 9/1/2020  Wells  Debit              4 
 9/1/2020  Wells  Debit              4 
 9/2/2020  BOA    Credit            -4 
 9/2/2020  BOA    Debit              4 
 9/3/2020  Chase  Debit              4 
'''
df = pd.read_csv(StringIO(data), sep='\s+', 
                 engine='python', parse_dates=['Date'])
df = df.reset_index().rename(columns={'index': 'seq_num'})

Next step:

# make a default dictionary for debits
# key => (Date + Party + Amount)
# value => list of seq_num
# same for credits (exept use -1 * Amount)

debits = defaultdict(list)
credits = defaultdict(list)

for row in df.itertuples():
    if row.Debit_Credit == 'Debit':
        key = (row.Date, row.Party, row.Amount)
        debits[key].append(row.seq_num)
    elif row.Debit_Credit == 'Credit':
        key = (row.Date, row.Party, (-1) * row.Amount)
        credits[key].append(row.seq_num)
    else:
        continue # can't get here!

Now iterate over the debits dict. If the key exists in the credits dict also, then we found a matching pair -- move the sequence numbers to an 'offsets' dict.

offsets = defaultdict(list)

for key, value in debits.items():
    # is this key also in credits?
    if key in credits:
        print(key, 'found offset!')
        debit_seq_num = value.pop()
        credit_seq_num = credits[key].pop()
        offsets[key].append((debit_seq_num, credit_seq_num))

Finally, we can print a little report, by iterating over each dict:

# print report

print('debits')
for key, value in debits.items():
    if value:
        print('    ', key, value)
        
print('credits')
for key, value in credits.items():
    if value:
        print('    ', key, value)

print('offsets')
for key, value in offsets.items():
    if value:
        print('    ', key, value)

debits
     (Timestamp('2020-09-01 00:00:00'), 'Wells', 4) [0, 2]
     (Timestamp('2020-09-03 00:00:00'), 'Chase', 4) [6]
credits
offsets
     (Timestamp('2020-09-01 00:00:00'), 'Wells', 4) [(3, 1)]
     (Timestamp('2020-09-02 00:00:00'), 'BOA', 4) [(5, 4)]

The offsets dict gives pairs of sequence numbers that are offsets. Note that union of debits, credits and offsets is the same as original data frame (we didn't double-count, and we didn't lose anything).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM