[英]Finding one to many matches in two Pandas Dataframes
我试图将财务数据的通用匹配过程放在一起。 目标是获取交易量较大的一组数据,并将其与交易量较小的一组数据进行匹配。 有些是一对多的,有些是一对一的。 在某些情况下,它可能会被逆转,部分方法是以相反的顺序反馈错过的比赛,以捕获那些可能的比赛。
我创建了三个不同的模块,以相互迭代来完成工作,但是并没有得到一致的结果。 我在我的数据中看到了可能的匹配项,但没有发现。
也没有明确的匹配标准,因此假设是如果我按日期顺序放置数据集,并寻找匹配值,我想进行第一个匹配,因为它应该更接近相同的时间范围。
我正在使用Pandas和Itertools,但可能不是理想的格式。 任何帮助获得一致的比赛将不胜感激。
Data examples:
Large Transaction Size:
AID AIssue Date AAmount
1508 3/14/2018 -560
1506 3/27/2018 -35
1500 4/25/2018 5000
Small Transaction Size:
BID BIssue Date BAmount
1063 3/6/2018 -300
1062 3/6/2018 -260
839 3/22/2018 -35
423 4/24/2018 5000
Expected Results
AID AIssue Date AAMount BID BIssue Date BAmount
1508 3/14/2018 -560 1063 3/6/2018 -300
1508 3/14/2018 -560 1062 3/6/2018 -260
1506 3/27/2018 -35 839 3/22/2018 -35
1500 4/25/2018 5000 423 4/24/2018 5000
but I usually get
AID AIssue Date AAMount BID BIssue Date BAmount
1508 3/14/2018 -560 1063 3/6/2018 -300
1508 3/14/2018 -560 1062 3/6/2018 -260
1506 3/27/2018 -35 839 3/22/2018 -35
与5000不匹配。 这是一个示例,但是在查看较大的数据集时,正负似乎不是因素。
当检查每个的不匹配结果时,我看到至少有一笔$ 5000的交易,我希望是1-1匹配,并且不在结果中。
def matches(iterable):
s = list(iterable)
#Only going to 5 matches to avoid memory overrun on large datasets
s = list(itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(5)))
return [list(elem) for elem in s]
def one_to_many(dfL, dfS, dID = 0, dDT = 1, dVal = 2):
#dfL = dataset with larger values
#dfS = dataset with smaller values
#dID = column index of ID record
#dDT = column index of date record
#dVal = column index of dollar value record
S = dfS[dfS.columns[dID]].values.tolist()
S_amount = dfS[dfS.columns[dVal]].values.tolist()
S = matches(S)
S_amount = matches(S_amount)
#get ID of first large record, the ID to be matched in this module
L = dfL[dfL.columns[dID]].iloc[0]
#get Value of first large record, this value will be matching criteria
L_amount = dfL[dfL.columns[dVal]].iloc[0]
count_of_sets = len(S)
for a in range(0,count_of_sets):
list_of_items = S[a]
list_of_values = S_amount[a]
if round(sum(list_of_values),2) == round(L_amount,2):
break
if round(sum(list_of_values),2) == round(L_amount,2):
retVal = list_of_items
else:
retVal = [-1]
return retVal
def iterate_one_to_many(dfLarge, dfSmall, dID = 0, dDT = 1, dVal = 2):
#dfL = dataset with larger values
#dfS = dataset with smaller values
#dID = column index of ID record
#dDT = column index of date record
#dVal = column index of dollar value record
#returns a list of dataframes [paired matches, unmatched from dfL, unmatched from dfS]
dfLarge = dfLarge.set_index(dfLarge.columns[dID]).sort_values([dfLarge.columns[dDT], dfLarge.columns[dVal]]).reset_index()
dfSmall = dfSmall.set_index(dfSmall.columns[dID]).sort_values([dfSmall.columns[dDT], dfSmall.columns[dVal]]).reset_index()
end_row = len(dfLarge.columns[dID]) - 1
matches_master = pd.DataFrame(data = None, columns = dfLarge.columns.append(dfSmall.columns))
for lg in range(0,end_row):
sm_match_id = one_to_many(dfLarge, dfSmall)
lg_match_id = dfLarge[dfLarge.columns[dID]][lg]
if sm_match_id != [-1]:
end_of_matches = len(sm_match_id)
for sm in range(0, end_of_matches):
if sm == 0:
sm_match = dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy()
dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()
else:
sm_match = sm_match.append(dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy())
dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()
lg_match = dfLarge.loc[dfLarge[dfLarge.columns[dID]] == lg_match_id].copy()
sm_match['Match'] = lg
lg_match['Match'] = lg
sm_match.set_index('Match', inplace=True)
lg_match.set_index('Match', inplace=True)
matches = lg_match.join(sm_match, how='left')
matches_master = matches_master.append(matches)
dfLarge = dfLarge.loc[dfLarge[dfLarge.columns[dID]] != lg_match_id].copy()
return [matches_master, dfLarge, dfSmall]
IIUUC,匹配项只是在大数据DataFrame
找到正在运行的交易,或者是与小交易中最接近的未来交易。 您可以使用pandas.merge_asof()
来基于将来的最接近日期进行匹配。
import pandas as pd
# Ensure your dates are datetime
df_large['AIssue Date'] = pd.to_datetime(df_large['AIssue Date'])
df_small['BIssue Date'] = pd.to_datetime(df_small['BIssue Date'])
merged = pd.merge_asof(df_small, df_large, left_on='BIssue Date',
right_on='AIssue Date', direction='forward')
现在merged
:
BID BAmount BIssue Date AID AAmount AIssue Date
0 1063 -300 2018-03-06 1508 -560 2018-03-14
1 1062 -260 2018-03-06 1508 -560 2018-03-14
2 839 -35 2018-03-22 1506 -35 2018-03-27
3 423 5000 2018-04-24 1500 5000 2018-04-25
如果您希望事情永远不会匹配,您还可以抛出一个tolerance
以将匹配限制在一个较小的窗口内。这样,一个DataFrame
中的缺失值不会抛出所有错误。
在我的模块iterate_one_to_many中,我错误地计算了行长。 我需要更换
end_row = len(dfLarge.columns[dID]) - 1
与
end_row = len(dfLarge.index)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.