I am attempting to put together a generic matching process for financial data. The goal is to take one set of data with larger transactions and match it to a set of data with smaller transactions. Some are one to many, others are one to one. There are a few times where it may be reversed and part of the approach is to feed back the miss matches in inverse order to capture those possible matches.
I have three different modules I have created to iterate across each other to complete the work, but I am not getting consistent results. I see possible matches in my data that should be picked up but are not.
There is no clear matching criteria either, so the assumption is if I put the datasets in date order, and look for matching values, I want to take the first match since it should be closer to the same timeframe.
I am using Pandas and Itertools, but maybe not in the ideal format. Any help to get consistent matches would be appreciated.
Data examples:
Large Transaction Size:
AID AIssue Date AAmount
1508 3/14/2018 -560
1506 3/27/2018 -35
1500 4/25/2018 5000
Small Transaction Size:
BID BIssue Date BAmount
1063 3/6/2018 -300
1062 3/6/2018 -260
839 3/22/2018 -35
423 4/24/2018 5000
Expected Results
AID AIssue Date AAMount BID BIssue Date BAmount
1508 3/14/2018 -560 1063 3/6/2018 -300
1508 3/14/2018 -560 1062 3/6/2018 -260
1506 3/27/2018 -35 839 3/22/2018 -35
1500 4/25/2018 5000 423 4/24/2018 5000
but I usually get
AID AIssue Date AAMount BID BIssue Date BAmount
1508 3/14/2018 -560 1063 3/6/2018 -300
1508 3/14/2018 -560 1062 3/6/2018 -260
1506 3/27/2018 -35 839 3/22/2018 -35
with the 5000 not matching. And this is one example, but positive negative does not appear to be the factor when looking at the larger data set.
When reviewing the unmatched results from each, I see at least one $5000 transaction I would expect to be a 1-1 match and it is not in the results.
def matches(iterable):
s = list(iterable)
#Only going to 5 matches to avoid memory overrun on large datasets
s = list(itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(5)))
return [list(elem) for elem in s]
def one_to_many(dfL, dfS, dID = 0, dDT = 1, dVal = 2):
#dfL = dataset with larger values
#dfS = dataset with smaller values
#dID = column index of ID record
#dDT = column index of date record
#dVal = column index of dollar value record
S = dfS[dfS.columns[dID]].values.tolist()
S_amount = dfS[dfS.columns[dVal]].values.tolist()
S = matches(S)
S_amount = matches(S_amount)
#get ID of first large record, the ID to be matched in this module
L = dfL[dfL.columns[dID]].iloc[0]
#get Value of first large record, this value will be matching criteria
L_amount = dfL[dfL.columns[dVal]].iloc[0]
count_of_sets = len(S)
for a in range(0,count_of_sets):
list_of_items = S[a]
list_of_values = S_amount[a]
if round(sum(list_of_values),2) == round(L_amount,2):
break
if round(sum(list_of_values),2) == round(L_amount,2):
retVal = list_of_items
else:
retVal = [-1]
return retVal
def iterate_one_to_many(dfLarge, dfSmall, dID = 0, dDT = 1, dVal = 2):
#dfL = dataset with larger values
#dfS = dataset with smaller values
#dID = column index of ID record
#dDT = column index of date record
#dVal = column index of dollar value record
#returns a list of dataframes [paired matches, unmatched from dfL, unmatched from dfS]
dfLarge = dfLarge.set_index(dfLarge.columns[dID]).sort_values([dfLarge.columns[dDT], dfLarge.columns[dVal]]).reset_index()
dfSmall = dfSmall.set_index(dfSmall.columns[dID]).sort_values([dfSmall.columns[dDT], dfSmall.columns[dVal]]).reset_index()
end_row = len(dfLarge.columns[dID]) - 1
matches_master = pd.DataFrame(data = None, columns = dfLarge.columns.append(dfSmall.columns))
for lg in range(0,end_row):
sm_match_id = one_to_many(dfLarge, dfSmall)
lg_match_id = dfLarge[dfLarge.columns[dID]][lg]
if sm_match_id != [-1]:
end_of_matches = len(sm_match_id)
for sm in range(0, end_of_matches):
if sm == 0:
sm_match = dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy()
dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()
else:
sm_match = sm_match.append(dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy())
dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()
lg_match = dfLarge.loc[dfLarge[dfLarge.columns[dID]] == lg_match_id].copy()
sm_match['Match'] = lg
lg_match['Match'] = lg
sm_match.set_index('Match', inplace=True)
lg_match.set_index('Match', inplace=True)
matches = lg_match.join(sm_match, how='left')
matches_master = matches_master.append(matches)
dfLarge = dfLarge.loc[dfLarge[dfLarge.columns[dID]] != lg_match_id].copy()
return [matches_master, dfLarge, dfSmall]
IIUUC, the match is just to find the transaction in the Large DataFrame
which is on or the closest future transaction to a transaction in the small one. You can use pandas.merge_asof()
to perform a match based on the closest date in the future.
import pandas as pd
# Ensure your dates are datetime
df_large['AIssue Date'] = pd.to_datetime(df_large['AIssue Date'])
df_small['BIssue Date'] = pd.to_datetime(df_small['BIssue Date'])
merged = pd.merge_asof(df_small, df_large, left_on='BIssue Date',
right_on='AIssue Date', direction='forward')
merged
is now:
BID BAmount BIssue Date AID AAmount AIssue Date
0 1063 -300 2018-03-06 1508 -560 2018-03-14
1 1062 -260 2018-03-06 1508 -560 2018-03-14
2 839 -35 2018-03-22 1506 -35 2018-03-27
3 423 5000 2018-04-24 1500 5000 2018-04-25
If you expect things to never match, you can also throw in a tolerance
to restrict the matches to within a smaller window., that way a missing value in one DataFrame
doesn't throw everything off.
in my module iterate_one_to_many, I was counting my row length incorrectly. I needed to replace
end_row = len(dfLarge.columns[dID]) - 1
with
end_row = len(dfLarge.index)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.