简体   繁体   中英

Finding one to many matches in two Pandas Dataframes

I am attempting to put together a generic matching process for financial data. The goal is to take one set of data with larger transactions and match it to a set of data with smaller transactions. Some are one to many, others are one to one. There are a few times where it may be reversed and part of the approach is to feed back the miss matches in inverse order to capture those possible matches.

I have three different modules I have created to iterate across each other to complete the work, but I am not getting consistent results. I see possible matches in my data that should be picked up but are not.

There is no clear matching criteria either, so the assumption is if I put the datasets in date order, and look for matching values, I want to take the first match since it should be closer to the same timeframe.

I am using Pandas and Itertools, but maybe not in the ideal format. Any help to get consistent matches would be appreciated.

Data examples:

Large Transaction Size:

AID    AIssue Date    AAmount
1508     3/14/2018   -560
1506     3/27/2018    -35
1500     4/25/2018   5000

Small Transaction Size:
BID     BIssue Date   BAmount
1063     3/6/2018     -300
1062     3/6/2018     -260
839      3/22/2018     -35
423      4/24/2018    5000

Expected Results
AID     AIssue Date   AAMount    BID     BIssue Date   BAmount
1508     3/14/2018     -560      1063      3/6/2018     -300
1508     3/14/2018     -560      1062      3/6/2018     -260
1506     3/27/2018      -35       839      3/22/2018     -35
1500     4/25/2018     5000       423      4/24/2018    5000

but I usually get
AID     AIssue Date   AAMount    BID     BIssue Date   BAmount
1508     3/14/2018     -560      1063      3/6/2018     -300
1508     3/14/2018     -560      1062      3/6/2018     -260
1506     3/27/2018      -35       839      3/22/2018     -35

with the 5000 not matching. And this is one example, but positive negative does not appear to be the factor when looking at the larger data set.

When reviewing the unmatched results from each, I see at least one $5000 transaction I would expect to be a 1-1 match and it is not in the results.

def matches(iterable):
    s = list(iterable)
    #Only going to 5 matches to avoid memory overrun on large datasets
    s = list(itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(5))) 

    return [list(elem) for elem in s]

def one_to_many(dfL, dfS, dID = 0, dDT = 1, dVal = 2):   
    #dfL = dataset with larger values
    #dfS = dataset with smaller values
    #dID = column index of ID record
    #dDT = column index of date record
    #dVal = column index of dollar value record

    S = dfS[dfS.columns[dID]].values.tolist()
    S_amount = dfS[dfS.columns[dVal]].values.tolist()

    S = matches(S)
    S_amount = matches(S_amount)

    #get ID of first large record, the ID to be matched in this module
    L = dfL[dfL.columns[dID]].iloc[0]

    #get Value of first large record, this value will be matching criteria
    L_amount = dfL[dfL.columns[dVal]].iloc[0]

    count_of_sets = len(S)

    for a in range(0,count_of_sets):

        list_of_items = S[a]
        list_of_values = S_amount[a]

        if round(sum(list_of_values),2) == round(L_amount,2):
            break

    if round(sum(list_of_values),2) == round(L_amount,2):
        retVal = list_of_items
    else:
        retVal = [-1]

    return retVal

def iterate_one_to_many(dfLarge, dfSmall, dID = 0, dDT = 1, dVal = 2):
    #dfL = dataset with larger values
    #dfS = dataset with smaller values
    #dID = column index of ID record
    #dDT = column index of date record
    #dVal = column index of dollar value record

    #returns a list of dataframes [paired matches, unmatched from dfL, unmatched from dfS]

    dfLarge = dfLarge.set_index(dfLarge.columns[dID]).sort_values([dfLarge.columns[dDT], dfLarge.columns[dVal]]).reset_index()
    dfSmall = dfSmall.set_index(dfSmall.columns[dID]).sort_values([dfSmall.columns[dDT], dfSmall.columns[dVal]]).reset_index()

    end_row = len(dfLarge.columns[dID]) - 1

    matches_master = pd.DataFrame(data = None, columns = dfLarge.columns.append(dfSmall.columns))

    for lg in range(0,end_row):

        sm_match_id = one_to_many(dfLarge, dfSmall)
        lg_match_id = dfLarge[dfLarge.columns[dID]][lg]

        if sm_match_id != [-1]:

            end_of_matches = len(sm_match_id)

            for sm in range(0, end_of_matches):
                if sm == 0:
                    sm_match = dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy()
                    dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()
                else:
                    sm_match = sm_match.append(dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy())
                    dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()

            lg_match = dfLarge.loc[dfLarge[dfLarge.columns[dID]] == lg_match_id].copy()

            sm_match['Match'] = lg
            lg_match['Match'] = lg

            sm_match.set_index('Match', inplace=True)
            lg_match.set_index('Match', inplace=True)

            matches = lg_match.join(sm_match, how='left')
            matches_master = matches_master.append(matches)

            dfLarge = dfLarge.loc[dfLarge[dfLarge.columns[dID]] != lg_match_id].copy()

    return [matches_master, dfLarge, dfSmall]

IIUUC, the match is just to find the transaction in the Large DataFrame which is on or the closest future transaction to a transaction in the small one. You can use pandas.merge_asof() to perform a match based on the closest date in the future.

import pandas as pd
# Ensure your dates are datetime
df_large['AIssue Date'] = pd.to_datetime(df_large['AIssue Date'])
df_small['BIssue Date'] = pd.to_datetime(df_small['BIssue Date'])

merged = pd.merge_asof(df_small, df_large, left_on='BIssue Date', 
                       right_on='AIssue Date', direction='forward')

merged is now:

    BID  BAmount BIssue Date   AID  AAmount AIssue Date
0  1063     -300  2018-03-06  1508     -560  2018-03-14
1  1062     -260  2018-03-06  1508     -560  2018-03-14
2   839      -35  2018-03-22  1506      -35  2018-03-27
3   423     5000  2018-04-24  1500     5000  2018-04-25

If you expect things to never match, you can also throw in a tolerance to restrict the matches to within a smaller window., that way a missing value in one DataFrame doesn't throw everything off.

in my module iterate_one_to_many, I was counting my row length incorrectly. I needed to replace

end_row = len(dfLarge.columns[dID]) - 1

with

end_row = len(dfLarge.index)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM