簡體   English   中英

在兩個Pandas數據框中找到一對多的匹配項

[英]Finding one to many matches in two Pandas Dataframes

我試圖將財務數據的通用匹配過程放在一起。 目標是獲取交易量較大的一組數據,並將其與交易量較小的一組數據進行匹配。 有些是一對多的,有些是一對一的。 在某些情況下,它可能會被逆轉,部分方法是以相反的順序反饋錯過的比賽,以捕獲那些可能的比賽。

我創建了三個不同的模塊,以相互迭代來完成工作,但是並沒有得到一致的結果。 我在我的數據中看到了可能的匹配項,但沒有發現。

也沒有明確的匹配標准,因此假設是如果我按日期順序放置數據集,並尋找匹配值,我想進行第一個匹配,因為它應該更接近相同的時間范圍。

我正在使用Pandas和Itertools,但可能不是理想的格式。 任何幫助獲得一致的比賽將不勝感激。

Data examples:

Large Transaction Size:

AID    AIssue Date    AAmount
1508     3/14/2018   -560
1506     3/27/2018    -35
1500     4/25/2018   5000

Small Transaction Size:
BID     BIssue Date   BAmount
1063     3/6/2018     -300
1062     3/6/2018     -260
839      3/22/2018     -35
423      4/24/2018    5000

Expected Results
AID     AIssue Date   AAMount    BID     BIssue Date   BAmount
1508     3/14/2018     -560      1063      3/6/2018     -300
1508     3/14/2018     -560      1062      3/6/2018     -260
1506     3/27/2018      -35       839      3/22/2018     -35
1500     4/25/2018     5000       423      4/24/2018    5000

but I usually get
AID     AIssue Date   AAMount    BID     BIssue Date   BAmount
1508     3/14/2018     -560      1063      3/6/2018     -300
1508     3/14/2018     -560      1062      3/6/2018     -260
1506     3/27/2018      -35       839      3/22/2018     -35

與5000不匹配。 這是一個示例,但是在查看較大的數據集時,正負似乎不是因素。

當檢查每個的不匹配結果時,我看到至少有一筆$ 5000的交易,我希望是1-1匹配,並且不在結果中。

def matches(iterable):
    s = list(iterable)
    #Only going to 5 matches to avoid memory overrun on large datasets
    s = list(itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(5))) 

    return [list(elem) for elem in s]

def one_to_many(dfL, dfS, dID = 0, dDT = 1, dVal = 2):   
    #dfL = dataset with larger values
    #dfS = dataset with smaller values
    #dID = column index of ID record
    #dDT = column index of date record
    #dVal = column index of dollar value record

    S = dfS[dfS.columns[dID]].values.tolist()
    S_amount = dfS[dfS.columns[dVal]].values.tolist()

    S = matches(S)
    S_amount = matches(S_amount)

    #get ID of first large record, the ID to be matched in this module
    L = dfL[dfL.columns[dID]].iloc[0]

    #get Value of first large record, this value will be matching criteria
    L_amount = dfL[dfL.columns[dVal]].iloc[0]

    count_of_sets = len(S)

    for a in range(0,count_of_sets):

        list_of_items = S[a]
        list_of_values = S_amount[a]

        if round(sum(list_of_values),2) == round(L_amount,2):
            break

    if round(sum(list_of_values),2) == round(L_amount,2):
        retVal = list_of_items
    else:
        retVal = [-1]

    return retVal

def iterate_one_to_many(dfLarge, dfSmall, dID = 0, dDT = 1, dVal = 2):
    #dfL = dataset with larger values
    #dfS = dataset with smaller values
    #dID = column index of ID record
    #dDT = column index of date record
    #dVal = column index of dollar value record

    #returns a list of dataframes [paired matches, unmatched from dfL, unmatched from dfS]

    dfLarge = dfLarge.set_index(dfLarge.columns[dID]).sort_values([dfLarge.columns[dDT], dfLarge.columns[dVal]]).reset_index()
    dfSmall = dfSmall.set_index(dfSmall.columns[dID]).sort_values([dfSmall.columns[dDT], dfSmall.columns[dVal]]).reset_index()

    end_row = len(dfLarge.columns[dID]) - 1

    matches_master = pd.DataFrame(data = None, columns = dfLarge.columns.append(dfSmall.columns))

    for lg in range(0,end_row):

        sm_match_id = one_to_many(dfLarge, dfSmall)
        lg_match_id = dfLarge[dfLarge.columns[dID]][lg]

        if sm_match_id != [-1]:

            end_of_matches = len(sm_match_id)

            for sm in range(0, end_of_matches):
                if sm == 0:
                    sm_match = dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy()
                    dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()
                else:
                    sm_match = sm_match.append(dfSmall.loc[dfSmall[dfSmall.columns[dID]] == sm_match_id[sm]].copy())
                    dfSmall = dfSmall.loc[dfSmall[dfSmall.columns[dID]] != sm_match_id[sm]].copy()

            lg_match = dfLarge.loc[dfLarge[dfLarge.columns[dID]] == lg_match_id].copy()

            sm_match['Match'] = lg
            lg_match['Match'] = lg

            sm_match.set_index('Match', inplace=True)
            lg_match.set_index('Match', inplace=True)

            matches = lg_match.join(sm_match, how='left')
            matches_master = matches_master.append(matches)

            dfLarge = dfLarge.loc[dfLarge[dfLarge.columns[dID]] != lg_match_id].copy()

    return [matches_master, dfLarge, dfSmall]

IIUUC,匹配項只是在大數據DataFrame找到正在運行的交易,或者是與小交易中最接近的未來交易。 您可以使用pandas.merge_asof()來基於將來的最接近日期進行匹配。

import pandas as pd
# Ensure your dates are datetime
df_large['AIssue Date'] = pd.to_datetime(df_large['AIssue Date'])
df_small['BIssue Date'] = pd.to_datetime(df_small['BIssue Date'])

merged = pd.merge_asof(df_small, df_large, left_on='BIssue Date', 
                       right_on='AIssue Date', direction='forward')

現在merged

    BID  BAmount BIssue Date   AID  AAmount AIssue Date
0  1063     -300  2018-03-06  1508     -560  2018-03-14
1  1062     -260  2018-03-06  1508     -560  2018-03-14
2   839      -35  2018-03-22  1506      -35  2018-03-27
3   423     5000  2018-04-24  1500     5000  2018-04-25

如果您希望事情永遠不會匹配,您還可以拋出一個tolerance以將匹配限制在一個較小的窗口內。這樣,一個DataFrame中的缺失值不會拋出所有錯誤。

在我的模塊iterate_one_to_many中,我錯誤地計算了行長。 我需要更換

end_row = len(dfLarge.columns[dID]) - 1

end_row = len(dfLarge.index)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM