[英]Python, Pandas matching and finding contents in two data frames
檢查一個數據幀中的內容是否也在另一個數據幀中。
原始數據框具有2列,ID及其對應的Fruits。 還有另一個大小不同的數據框(行和列數)
在原始數據幀中,如果ID與ID_1匹配,並且ID的對應水果在ID_1的對應Content或Content_1中,請創建一個新列來指示它。 (所需的輸出在此問題的結尾)
我試圖合並兩個數據框以進行進一步處理。 到目前為止,我有:
import pandas as pd
data = {'ID': ["4589", "14805", "23591", "47089", "56251", "85964", "235225", "322624", "342225", "380689", "480562", "5623", "85624", "866278"],
'Fruit' : ["Avocado", "Blackberry", "Black Sapote", "Fingered Citron", "Crab Apples", "Custard Apple", "Chico Fruit", "Coconut", "Damson", "Elderberry", "Goji Berry", "Grape", "Guava", "Huckleberry"]
}
data_1 = {'ID_1': ["488", "14805", "23591", "470995", "56251", "85964", "5268", "322624", "342225", "380689", "480562", "5623"],
'Content' : ["Kalo Beruin", "this is Blackberry", "Khara Beruin", "Khato Dosh", "Lapha", "Loha Sura", "Matichak", "Miniket Rice", "Mou Beruin", "Moulata", "oh Goji Berry", "purple Grape"],
'Content_1' : ["Jook-sing noodles", "Kaomianjin", "Lai fun", "Lamian", "Liangpi", "who wants Custard Apple", "Misua", "nana Coconut", "Damson", "Paomo", "Ramen", "Rice vermicelli"]
}
df = pd.DataFrame(data)
df = df[['ID', 'Fruit']]
df_1 = pd.DataFrame(data_1)
df_1 = df_1[['ID_1', 'Content', 'Content_1']]
result = df.merge(df_1, left_on = 'ID', right_on = 'ID_1', how = 'outer')
for index, row in result.iterrows():
if row["ID"] == row["ID_1"] and row["Fruit"] in row["Content"] or row["Fruit"] in row["Content_1"]:
print row["ID"] + row["Fruit"]
它給我TypeError:類型'float'的參數是不可迭代的
(我使用的Pandas版本是v.0.20.3。)
我該如何實現? 謝謝。
在某些情況下, row["Content"]
和row["Content_1"]
為NaN
。 NaN
是一個float
,並且也是不可迭代的-這就是為什么會出現錯誤的原因。
您可以使用try
/ except
捕獲這些:
for index, row in result.iterrows():
try:
if row["ID"] == row["ID_1"] and row["Fruit"] in row["Content"] or row["Fruit"] in row["Content_1"]:
print( str(row["ID"]) + row["Fruit"])
except TypeError as e:
print(e, "for:")
print(row)
我認為您的合並工作正常。 要獲得您指定的輸出,只需添加一個Matched
列以檢查NaN
值:
result = df.merge(df_1, left_on = 'ID', right_on = 'ID_1', how = 'outer')
result["Matched"] = np.where(result.isnull().any(axis=1), "N", "Y")
result
ID Fruit ID_1 Content \
0 4589 Avocado NaN NaN
1 14805 Blackberry 14805 this is Blackberry
2 23591 Black Sapote 23591 Khara Beruin
3 47089 Fingered Citron NaN NaN
4 56251 Crab Apples 56251 Lapha
5 85964 Custard Apple 85964 Loha Sura
Content_1 Matched
0 NaN N
1 Kaomianjin Y
2 Lai fun Y
3 NaN N
4 Liangpi Y
5 who wants Custard Apple Y
我認為需要:
#swap DataFrames with left join
result = df_1.merge(df, left_on = 'ID_1', right_on = 'ID', how = 'left')
#remove NaNs and create pattern with word boundary for check substrings
pat = r'\b{}\b'.format('|'.join(result["Fruit"].dropna()))
#boolan mask - rewritten iterrows to vectorized way
mask = ((result["ID"] == result["ID_1"]) &
result["Content"].str.contains(pat, na=False) |
result["Content_1"].str.contains(pat, na=False))
#remove unnecessary columns
result = result.drop(['ID','Fruit'], axis=1)
#add indicator column
result['matched'] = np.where(mask, 'Y', '')
print (result)
ID_1 Content Content_1 matched
0 488 Kalo Beruin Jook-sing noodles
1 14805 this is Blackberry Kaomianjin Y
2 23591 Khara Beruin Lai fun
3 470995 Khato Dosh Lamian
4 56251 Lapha Liangpi
5 85964 Loha Sura who wants Custard Apple Y
6 5268 Matichak Misua
7 322624 Miniket Rice nana Coconut Y
8 342225 Mou Beruin Damson Y
9 380689 Moulata Paomo
10 480562 oh Goji Berry Ramen Y
11 5623 purple Grape Rice vermicelli Y
具有outer
聯接的舊解決方案:
result = df.merge(df_1, left_on = 'ID', right_on = 'ID_1', how = 'outer')
pat = r'\b{}\b'.format('|'.join(result["Fruit"].dropna()))
mask = ((result["ID"] == result["ID_1"]) &
result["Content"].str.contains(pat, na=False)|
result["Content_1"].str.contains(pat, na=False))
result['matched'] = np.where(mask, 'Y', '')
print (result)
ID Fruit ID_1 Content \
0 4589 Avocado NaN NaN
1 14805 Blackberry 14805 this is Blackberry
2 23591 Black Sapote 23591 Khara Beruin
3 47089 Fingered Citron NaN NaN
4 56251 Crab Apples 56251 Lapha
5 85964 Custard Apple 85964 Loha Sura
6 235225 Chico Fruit NaN NaN
7 322624 Coconut 322624 Miniket Rice
8 342225 Damson 342225 Mou Beruin
9 380689 Elderberry 380689 Moulata
10 480562 Goji Berry 480562 oh Goji Berry
11 5623 Grape 5623 purple Grape
12 85624 Guava NaN NaN
13 866278 Huckleberry NaN NaN
14 NaN NaN 488 Kalo Beruin
15 NaN NaN 470995 Khato Dosh
16 NaN NaN 5268 Matichak
Content_1 matched
0 NaN
1 Kaomianjin Y
2 Lai fun
3 NaN
4 Liangpi
5 who wants Custard Apple Y
6 NaN
7 nana Coconut Y
8 Damson Y
9 Paomo
10 Ramen Y
11 Rice vermicelli Y
12 NaN
13 NaN
14 Jook-sing noodles
15 Lamian
16 Misua
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.