簡體   English   中英

匹配兩個不同數據幀的兩列的子集

[英]Matching subset of two columns of two different dataframes

比較兩個不同數據幀的特定列。 計算兩個數據幀的子集是匹配還是不匹配。

條件:如果文件small['genes of cluster'] 任何元素big['genes of cluster'] match: 1 ,則輸出應為: match: 1

對於以下示例,僅OR4F16與兩個數據幀匹配。 所以輸出: match: 1; unmatch: 3. match: 1; unmatch: 3.

    file1: big <tab separated>
    cl    nP    genes of cluster
     1    11    DDX11L1, MIR6859-3, WASH7P, MIR1302-2, FAM138C, FAM138F, FAM138A, OR4F5, LOC729737, LOC102725121, FAM138D
     2     4    OR4F16, OR4F3, OR4F29, LOC100132287
     3    64    LOC100133331, LOC100288069, FAM87B, LINC00115, LINC01128, FAM41C, LINC02593, SAMD11
     4     7    GNB1, CALML6, TMEM52, CFAP74, GABRD, LOC105378591, PRKCZ


    file2: small <tab separated>
    cl    nP    genes of cluster
     1    11    A, B, C, D
     2     4    OR4F16, X, Y, Z

我的代碼:Python3

def genes_coordinates(big, small):
    b = pd.read_csv(big, header=0, sep="\t")
    s = pd.read_csv(small, header=0, sep="\t")

    match = 0
    unmatch = 0

    for index, row in b.iterrows():
        if row[row['genes of cluster'].isin(s['genes of cluster'])]:
            match+1
        else:
            unmatch+1
    print("match: ", match, "\nunmatch: ", unmatch)

genes_coordinates('big','small')

我會用pandas.merge()跟隨列表理解計數。

import pandas as pd

df1 = pd.DataFrame({'cl':[1,2], 'nP':[11,4], 'gene of cluster':[['A', 'B', 'C', 'D'], ['OR4F16', 'X', 'Y', 'Z']]})
df2 = pd.DataFrame({'cl':[1,2,3,4], 'nP':[11,4,64,7], 'gene of cluster':[['DDX11L1', 'MIR6859-3', 'WASH7P', 'MIR1302-2', 'FAM138C', 'FAM138F', 'FAM138A', 'OR4F5', 'LOC729737', 'LOC102725121', 'FAM138D'], ['OR4F16', 'OR4F3', 'OR4F29', 'LOC100132287'], ['LOC100133331', 'LOC100288069', 'FAM87B', 'LINC00115', 'LINC01128', 'FAM41C', 'LINC02593', 'SAMD11'], ['GNB1', 'CALML6', 'TMEM52', 'CFAP74', 'GABRD', 'LOC105378591', 'PRKCZ']]})

df_m = df1.merge(df2, on=['cl', 'nP'], how='outer')
>>>df_m

   cl  nP  gene of cluster_x                                  gene of cluster_y
0   1  11       [A, B, C, D]  [DDX11L1, MIR6859-3, WASH7P, MIR1302-2, FAM138...
1   2   4  [OR4F16, X, Y, Z]              [OR4F16, OR4F3, OR4F29, LOC100132287]
2   3  64                NaN  [LOC100133331, LOC100288069, FAM87B, LINC00115...
3   4   7                NaN  [GNB1, CALML6, TMEM52, CFAP74, GABRD, LOC10537...

# An np.nan value is an outright 'unmatch'
found = []
for x in df_m.index:
    if isinstance(df_m.iloc[x]['gene of cluster_x'], float):
        found.append(0)
    else:
        if isinstance(df_m.iloc[x]['gene of cluster_y'], float):
            found.append(0)
        elif any([y in df_m.iloc[x]['gene of cluster_y'] for y in df_m.iloc[x]['gene of cluster_x']]):
            found.append(1)
        else:
            found.append(0)
# The counts
match = sum(found)
unmatch = len(found) - match

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM