繁体   English   中英

在两个Pandas数据框中找到一对多的匹配项

[英]Finding one to many matches in two Pandas Dataframes

我试图将财务数据的通用匹配过程放在一起。 目标是获取交易量较大的一组数据,并将其与交易量较小的一组数据进行匹配。 有些是一对多的,有些是一对一的。 在某些情况下,它可能会被逆转,部分方法是以相反的顺序反馈错过的比赛,以捕获那些可能的比赛。

我创建了三个不同的模块,以相互迭代来完成工作,但是并没有得到一致的结果。 我在我的数据中看到了可能的匹配项,但没有发现。

也没有明确的匹配标准,因此假设是如果我按日期顺序放置数据集,并寻找匹配值,我想进行第一个匹配,因为它应该更接近相同的时间范围。

我正在使用Pandas和Itertools,但可能不是理想的格式。 任何帮助获得一致的比赛将不胜感激。

Data examples:

Large Transaction Size:

AID    AIssue Date    AAmount
1508     3/14/2018   -560
1506     3/27/2018    -35
1500     4/25/2018   5000

Small Transaction Size:
BID     BIssue Date   BAmount
1063     3/6/2018     -300
1062     3/6/2018     -260
839      3/22/2018     -35
423      4/24/2018    5000

Expected Results
AID     AIssue Date   AAMount    BID     BIssue Date   BAmount
1508     3/14/2018     -560      1063      3/6/2018     -300
1508     3/14/2018     -560      1062      3/6/2018     -260
1506     3/27/2018      -35       839      3/22/2018     -35
1500     4/25/2018     5000       423      4/24/2018    5000

but I usually get
AID     AIssue Date   AAMount    BID     BIssue Date   BAmount
1508     3/14/2018     -560      1063      3/6/2018     -300
1508     3/14/2018     -560      1062      3/6/2018     -260
1506     3/27/2018      -35       839      3/22/2018     -35

与5000不匹配。 这是一个示例,但是在查看较大的数据集时,正负似乎不是因素。

当检查每个的不匹配结果时,我看到至少有一笔$ 5000的交易,我希望是1-1匹配,并且不在结果中。

def matches(iterable):
    s = list(iterable)
    #Only going to 5 matches to avoid memory overrun on large datasets
    s = list(itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(5))) 

    return [list(elem) for elem in s]

def one_to_many(dfL, dfS, dID = 0, dDT = 1, dVal = 2):   
    #dfL = dataset with larger values
    #dfS = dataset with smaller values
    #dID = column index of ID record
    #dDT = column index of date record
    #dVal = column index of dollar value record

    S = dfS[dfS.columns[dID]].values.tolist()
    S_amount = dfS[dfS.columns[dVal]].values.tolist()

    S = matches(S)
    S_amount = matches(S_amount)

    #get ID of first large record, the ID to be matched in this module
    L = dfL[dfL.columns[dID]].iloc[0]

    #get Value of first large record, this value will be matching criteria
    L_amount = dfL[dfL.columns[dVal]].iloc[0]

    count_of_sets = len(S)

    for a in range(0,count_of_sets):

        list_of_items = S[a]
        list_of_values = S_amount[a]

        if round(sum(list_of_values),2) == round(L_amount,2):
            break

    if round(sum(list_of_values),2) == round(L_amount,2):
        retVal = list_of_items
    else:
        retVal = [-1]

    return retVal

def iterate_one_to_many(dfLarge, dfSmall, dID = 0, dDT = 1, dVal = 2):
    #dfL = dataset with larger values
    #dfS = dataset with smaller values
    #dID = column index of ID record
    #dDT = column index of date record
    #dVal = column index of dollar value record

    #returns a list of dataframes [paired matches, unmatched from dfL, unmatched from dfS]

    dfLarge = dfLarge.set_index(dfLarge.columns[dID]).sort_values([dfLarge.columns[dDT], dfLarge.columns[dVal]]).reset_index()
    dfSmall = dfSmall.set_index(dfSmall.columns[dID]).sort_values([dfSmall.columns[dDT], dfSmall.columns[dVal]]).reset_index()

    end_row = len(dfLarge.columns[dID]) - 1

    matches_master = pd.DataFrame(data = None, columns = dfLarge.columns.append(dfSmall.columns))

    for lg in range(0,end_row):

        sm_match_id = one_to_many(dfLarge, dfSmall)
        lg_match_id = dfLarge[dfLarge.columns[dID]][lg]

        if sm_match_id != [-1]:

            end_of_matches = len(sm_match_id)

            for sm in range(0, end_of_matches):
                if sm == 0:
                    sm_match = dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy()
                    dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()
                else:
                    sm_match = sm_match.append(dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy())
                    dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()

            lg_match = dfLarge.loc[dfLarge[dfLarge.columns[dID]] == lg_match_id].copy()

            sm_match['Match'] = lg
            lg_match['Match'] = lg

            sm_match.set_index('Match', inplace=True)
            lg_match.set_index('Match', inplace=True)

            matches = lg_match.join(sm_match, how='left')
            matches_master = matches_master.append(matches)

            dfLarge = dfLarge.loc[dfLarge[dfLarge.columns[dID]] != lg_match_id].copy()

    return [matches_master, dfLarge, dfSmall]

IIUUC,匹配项只是在大数据DataFrame找到正在运行的交易,或者是与小交易中最接近的未来交易。 您可以使用pandas.merge_asof()来基于将来的最接近日期进行匹配。

import pandas as pd
# Ensure your dates are datetime
df_large['AIssue Date'] = pd.to_datetime(df_large['AIssue Date'])
df_small['BIssue Date'] = pd.to_datetime(df_small['BIssue Date'])

merged = pd.merge_asof(df_small, df_large, left_on='BIssue Date', 
                       right_on='AIssue Date', direction='forward')

现在merged

    BID  BAmount BIssue Date   AID  AAmount AIssue Date
0  1063     -300  2018-03-06  1508     -560  2018-03-14
1  1062     -260  2018-03-06  1508     -560  2018-03-14
2   839      -35  2018-03-22  1506      -35  2018-03-27
3   423     5000  2018-04-24  1500     5000  2018-04-25

如果您希望事情永远不会匹配,您还可以抛出一个tolerance以将匹配限制在一个较小的窗口内。这样,一个DataFrame中的缺失值不会抛出所有错误。

在我的模块iterate_one_to_many中,我错误地计算了行长。 我需要更换

end_row = len(dfLarge.columns[dID]) - 1

end_row = len(dfLarge.index)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM