簡體   English   中英

Python,Pandas匹配並在兩個數據框中查找內容

[英]Python, Pandas matching and finding contents in two data frames

檢查一個數據幀中的內容是否也在另一個數據幀中。

原始數據框具有2列,ID及其對應的Fruits。 還有另一個大小不同的數據框(行和列數)

在原始數據幀中,如果ID與ID_1匹配,並且ID的對應水果在ID_1的對應Content或Content_1中,請創建一個新列來指示它。 (所需的輸出在此問題的結尾)

我試圖合並兩個數據框以進行進一步處理。 到目前為止,我有:

import pandas as pd

data = {'ID': ["4589", "14805", "23591", "47089", "56251", "85964", "235225", "322624", "342225", "380689", "480562", "5623", "85624", "866278"], 
'Fruit' : ["Avocado", "Blackberry", "Black Sapote", "Fingered Citron", "Crab Apples", "Custard Apple", "Chico Fruit", "Coconut", "Damson", "Elderberry", "Goji Berry", "Grape", "Guava", "Huckleberry"]
}

data_1 = {'ID_1': ["488", "14805", "23591", "470995", "56251", "85964", "5268", "322624", "342225", "380689", "480562", "5623"], 
'Content' : ["Kalo Beruin", "this is Blackberry", "Khara Beruin", "Khato Dosh", "Lapha", "Loha Sura", "Matichak", "Miniket Rice", "Mou Beruin", "Moulata", "oh Goji Berry", "purple Grape"],
'Content_1' : ["Jook-sing noodles", "Kaomianjin", "Lai fun", "Lamian", "Liangpi", "who wants Custard Apple", "Misua", "nana Coconut", "Damson", "Paomo", "Ramen", "Rice vermicelli"]
}

df = pd.DataFrame(data)
df = df[['ID', 'Fruit']]

df_1 = pd.DataFrame(data_1)
df_1 = df_1[['ID_1', 'Content', 'Content_1']]

result = df.merge(df_1, left_on = 'ID', right_on = 'ID_1', how = 'outer')

for index, row in result.iterrows():
    if row["ID"] == row["ID_1"] and row["Fruit"] in row["Content"] or row["Fruit"] in row["Content_1"]:
        print row["ID"] + row["Fruit"]

它給我TypeError:類型'float'的參數是不可迭代的

(我使用的Pandas版本是v.0.20.3。)

我該如何實現? 謝謝。

在此處輸入圖片說明

在某些情況下, row["Content"]row["Content_1"]NaN NaN是一個float ,並且也是不可迭代的-這就是為什么會出現錯誤的原因。

您可以使用try / except捕獲這些:

for index, row in result.iterrows():
    try:
        if row["ID"] == row["ID_1"] and row["Fruit"] in row["Content"] or row["Fruit"] in row["Content_1"]:
            print( str(row["ID"]) + row["Fruit"])
    except TypeError as e:
        print(e, "for:")
        print(row)

我認為您的合並工作正常。 要獲得您指定的輸出,只需添加一個Matched列以檢查NaN值:

result = df.merge(df_1, left_on = 'ID', right_on = 'ID_1', how = 'outer')
result["Matched"] = np.where(result.isnull().any(axis=1), "N", "Y")

result

        ID            Fruit    ID_1             Content  \
0     4589          Avocado     NaN                 NaN   
1    14805       Blackberry   14805  this is Blackberry   
2    23591     Black Sapote   23591        Khara Beruin   
3    47089  Fingered Citron     NaN                 NaN   
4    56251      Crab Apples   56251               Lapha   
5    85964    Custard Apple   85964           Loha Sura   

                  Content_1 Matched  
0                       NaN       N  
1                Kaomianjin       Y  
2                   Lai fun       Y  
3                       NaN       N  
4                   Liangpi       Y  
5   who wants Custard Apple       Y  

我認為需要:

#swap DataFrames with left join
result = df_1.merge(df, left_on = 'ID_1', right_on = 'ID', how = 'left')

#remove NaNs and create pattern with word boundary for check substrings
pat = r'\b{}\b'.format('|'.join(result["Fruit"].dropna()))

#boolan mask - rewritten iterrows to vectorized way
mask = ((result["ID"] == result["ID_1"]) & 
         result["Content"].str.contains(pat, na=False) |
         result["Content_1"].str.contains(pat, na=False))

#remove unnecessary columns
result = result.drop(['ID','Fruit'], axis=1)
#add indicator column
result['matched'] = np.where(mask, 'Y', '')

print (result)
      ID_1             Content                Content_1 matched
0      488         Kalo Beruin        Jook-sing noodles        
1    14805  this is Blackberry               Kaomianjin       Y
2    23591        Khara Beruin                  Lai fun        
3   470995          Khato Dosh                   Lamian        
4    56251               Lapha                  Liangpi        
5    85964           Loha Sura  who wants Custard Apple       Y
6     5268            Matichak                    Misua        
7   322624        Miniket Rice             nana Coconut       Y
8   342225          Mou Beruin                   Damson       Y
9   380689             Moulata                    Paomo        
10  480562       oh Goji Berry                    Ramen       Y
11    5623        purple Grape          Rice vermicelli       Y

具有outer聯接的舊解決方案:

result = df.merge(df_1, left_on = 'ID', right_on = 'ID_1', how = 'outer')

pat = r'\b{}\b'.format('|'.join(result["Fruit"].dropna()))

mask = ((result["ID"] == result["ID_1"]) & 
         result["Content"].str.contains(pat, na=False)|     
         result["Content_1"].str.contains(pat, na=False))

result['matched'] = np.where(mask, 'Y', '')

print (result)

        ID            Fruit    ID_1             Content  \
0     4589          Avocado     NaN                 NaN   
1    14805       Blackberry   14805  this is Blackberry   
2    23591     Black Sapote   23591        Khara Beruin   
3    47089  Fingered Citron     NaN                 NaN   
4    56251      Crab Apples   56251               Lapha   
5    85964    Custard Apple   85964           Loha Sura   
6   235225      Chico Fruit     NaN                 NaN   
7   322624          Coconut  322624        Miniket Rice   
8   342225           Damson  342225          Mou Beruin   
9   380689       Elderberry  380689             Moulata   
10  480562       Goji Berry  480562       oh Goji Berry   
11    5623            Grape    5623        purple Grape   
12   85624            Guava     NaN                 NaN   
13  866278      Huckleberry     NaN                 NaN   
14     NaN              NaN     488         Kalo Beruin   
15     NaN              NaN  470995          Khato Dosh   
16     NaN              NaN    5268            Matichak   

                  Content_1 matched  
0                       NaN          
1                Kaomianjin       Y  
2                   Lai fun          
3                       NaN          
4                   Liangpi          
5   who wants Custard Apple       Y  
6                       NaN          
7              nana Coconut       Y  
8                    Damson       Y  
9                     Paomo          
10                    Ramen       Y  
11          Rice vermicelli       Y  
12                      NaN          
13                      NaN          
14        Jook-sing noodles          
15                   Lamian          
16                    Misua         

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM