繁体   English   中英

Python,Pandas匹配并在两个数据框中查找内容

[英]Python, Pandas matching and finding contents in two data frames

检查一个数据帧中的内容是否也在另一个数据帧中。

原始数据框具有2列,ID及其对应的Fruits。 还有另一个大小不同的数据框(行和列数)

在原始数据帧中,如果ID与ID_1匹配,并且ID的对应水果在ID_1的对应Content或Content_1中,请创建一个新列来指示它。 (所需的输出在此问题的结尾)

我试图合并两个数据框以进行进一步处理。 到目前为止,我有:

import pandas as pd

data = {'ID': ["4589", "14805", "23591", "47089", "56251", "85964", "235225", "322624", "342225", "380689", "480562", "5623", "85624", "866278"], 
'Fruit' : ["Avocado", "Blackberry", "Black Sapote", "Fingered Citron", "Crab Apples", "Custard Apple", "Chico Fruit", "Coconut", "Damson", "Elderberry", "Goji Berry", "Grape", "Guava", "Huckleberry"]
}

data_1 = {'ID_1': ["488", "14805", "23591", "470995", "56251", "85964", "5268", "322624", "342225", "380689", "480562", "5623"], 
'Content' : ["Kalo Beruin", "this is Blackberry", "Khara Beruin", "Khato Dosh", "Lapha", "Loha Sura", "Matichak", "Miniket Rice", "Mou Beruin", "Moulata", "oh Goji Berry", "purple Grape"],
'Content_1' : ["Jook-sing noodles", "Kaomianjin", "Lai fun", "Lamian", "Liangpi", "who wants Custard Apple", "Misua", "nana Coconut", "Damson", "Paomo", "Ramen", "Rice vermicelli"]
}

df = pd.DataFrame(data)
df = df[['ID', 'Fruit']]

df_1 = pd.DataFrame(data_1)
df_1 = df_1[['ID_1', 'Content', 'Content_1']]

result = df.merge(df_1, left_on = 'ID', right_on = 'ID_1', how = 'outer')

for index, row in result.iterrows():
    if row["ID"] == row["ID_1"] and row["Fruit"] in row["Content"] or row["Fruit"] in row["Content_1"]:
        print row["ID"] + row["Fruit"]

它给我TypeError:类型'float'的参数是不可迭代的

(我使用的Pandas版本是v.0.20.3。)

我该如何实现? 谢谢。

在此处输入图片说明

在某些情况下, row["Content"]row["Content_1"]NaN NaN是一个float ,并且也是不可迭代的-这就是为什么会出现错误的原因。

您可以使用try / except捕获这些:

for index, row in result.iterrows():
    try:
        if row["ID"] == row["ID_1"] and row["Fruit"] in row["Content"] or row["Fruit"] in row["Content_1"]:
            print( str(row["ID"]) + row["Fruit"])
    except TypeError as e:
        print(e, "for:")
        print(row)

我认为您的合并工作正常。 要获得您指定的输出,只需添加一个Matched列以检查NaN值:

result = df.merge(df_1, left_on = 'ID', right_on = 'ID_1', how = 'outer')
result["Matched"] = np.where(result.isnull().any(axis=1), "N", "Y")

result

        ID            Fruit    ID_1             Content  \
0     4589          Avocado     NaN                 NaN   
1    14805       Blackberry   14805  this is Blackberry   
2    23591     Black Sapote   23591        Khara Beruin   
3    47089  Fingered Citron     NaN                 NaN   
4    56251      Crab Apples   56251               Lapha   
5    85964    Custard Apple   85964           Loha Sura   

                  Content_1 Matched  
0                       NaN       N  
1                Kaomianjin       Y  
2                   Lai fun       Y  
3                       NaN       N  
4                   Liangpi       Y  
5   who wants Custard Apple       Y  

我认为需要:

#swap DataFrames with left join
result = df_1.merge(df, left_on = 'ID_1', right_on = 'ID', how = 'left')

#remove NaNs and create pattern with word boundary for check substrings
pat = r'\b{}\b'.format('|'.join(result["Fruit"].dropna()))

#boolan mask - rewritten iterrows to vectorized way
mask = ((result["ID"] == result["ID_1"]) & 
         result["Content"].str.contains(pat, na=False) |
         result["Content_1"].str.contains(pat, na=False))

#remove unnecessary columns
result = result.drop(['ID','Fruit'], axis=1)
#add indicator column
result['matched'] = np.where(mask, 'Y', '')

print (result)
      ID_1             Content                Content_1 matched
0      488         Kalo Beruin        Jook-sing noodles        
1    14805  this is Blackberry               Kaomianjin       Y
2    23591        Khara Beruin                  Lai fun        
3   470995          Khato Dosh                   Lamian        
4    56251               Lapha                  Liangpi        
5    85964           Loha Sura  who wants Custard Apple       Y
6     5268            Matichak                    Misua        
7   322624        Miniket Rice             nana Coconut       Y
8   342225          Mou Beruin                   Damson       Y
9   380689             Moulata                    Paomo        
10  480562       oh Goji Berry                    Ramen       Y
11    5623        purple Grape          Rice vermicelli       Y

具有outer联接的旧解决方案:

result = df.merge(df_1, left_on = 'ID', right_on = 'ID_1', how = 'outer')

pat = r'\b{}\b'.format('|'.join(result["Fruit"].dropna()))

mask = ((result["ID"] == result["ID_1"]) & 
         result["Content"].str.contains(pat, na=False)|     
         result["Content_1"].str.contains(pat, na=False))

result['matched'] = np.where(mask, 'Y', '')

print (result)

        ID            Fruit    ID_1             Content  \
0     4589          Avocado     NaN                 NaN   
1    14805       Blackberry   14805  this is Blackberry   
2    23591     Black Sapote   23591        Khara Beruin   
3    47089  Fingered Citron     NaN                 NaN   
4    56251      Crab Apples   56251               Lapha   
5    85964    Custard Apple   85964           Loha Sura   
6   235225      Chico Fruit     NaN                 NaN   
7   322624          Coconut  322624        Miniket Rice   
8   342225           Damson  342225          Mou Beruin   
9   380689       Elderberry  380689             Moulata   
10  480562       Goji Berry  480562       oh Goji Berry   
11    5623            Grape    5623        purple Grape   
12   85624            Guava     NaN                 NaN   
13  866278      Huckleberry     NaN                 NaN   
14     NaN              NaN     488         Kalo Beruin   
15     NaN              NaN  470995          Khato Dosh   
16     NaN              NaN    5268            Matichak   

                  Content_1 matched  
0                       NaN          
1                Kaomianjin       Y  
2                   Lai fun          
3                       NaN          
4                   Liangpi          
5   who wants Custard Apple       Y  
6                       NaN          
7              nana Coconut       Y  
8                    Damson       Y  
9                     Paomo          
10                    Ramen       Y  
11          Rice vermicelli       Y  
12                      NaN          
13                      NaN          
14        Jook-sing noodles          
15                   Lamian          
16                    Misua         

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM