![](/img/trans.png)
[英]Faster way to get row data (based on a condition) from one dataframe and merge onto another b pandas python
[英]Pandas: Merge values from one dataframe to another based on condition
使用模糊邏輯和fuzzywuzzy
模塊,我能夠將名稱(來自一個數據幀)與短名稱(來自另一個數據幀)匹配。 這兩個數據框還包含一個表 ISIN。
這是應用邏輯后得到的 dataframe。
ISIN Name Currency Value % Weight Asset Type Comments/ Assumptions matches
236 NaN Partnerre Ltd 4.875% Perp Sr:J USD 1.684069e+05 0.0004 NaN NaN
237 NaN Berkley (Wr) Corporation 5.700% 03/30/58 USD 6.955837e+04 0.0002 NaN NaN
238 NaN Tc Energy Corp Flt Perp Sr:11 USD 6.380262e+04 0.0001 NaN NaN TC ENERGY CORP
239 NaN Cash and Equivalents USD 2.166579e+07 0.0499 NaN NaN
240 NaN AUM NaN 4.338766e+08 0.9999 NaN NaN AUM IND BARC US
創建了一個新列“匹配”,這基本上意味着來自第二個 dataframe 的短名稱與來自第一個 dataframe 的名稱匹配。
來自 dataframe1 的 ISIN 為空,來自 dataframe2 的 ISIN 存在。 在隨后的匹配中(第一個 Dataframe 的名稱和第二個數據幀的短名稱),我想將第二個 dataframe 中的相關 ISIN 添加到第一個 Z6A8064B5DF479455500553C47DZ55500553C47DZC。
如何從第二個 dataframe 到第一個 dataframe 獲取 ISIN,以便我的最終 output 看起來像這樣?
ISIN Name Currency Value % Weight Asset Type Comments/ Assumptions matches
236 NaN Partnerre Ltd 4.875% Perp Sr:J USD 1.684069e+05 0.0004 NaN NaN
237 NaN Berkley (Wr) Corporation 5.700% 03/30/58 USD 6.955837e+04 0.0002 NaN NaN
238 78s9 Tc Energy Corp Flt Perp Sr:11 USD 6.380262e+04 0.0001 NaN NaN TC ENERGY CORP
239 NaN Cash and Equivalents USD 2.166579e+07 0.0499 NaN NaN
240 123e AUM NaN 4.338766e+08 0.9999 NaN NaN AUM IND BARC US
編輯:數據框及其原始形式 df1
ISIN Name Currency Value % Weight Asset Type Comments/ Assumptions
0 NaN Transcanada Trust 5.875 08/15/76 USD 7616765.00 0.0176 NaN https://assets.cohenandsteers.com/assets/conte...
1 NaN Bp Capital Markets Plc Flt Perp USD 7348570.50 0.0169 NaN Holding value for each constituent is derived ...
2 NaN Transcanada Trust Flt 09/15/79 USD 7341250.00 0.0169 NaN NaN
3 NaN Bp Capital Markets Plc Flt Perp USD 6734022.32 0.0155 NaN NaN
4 NaN Prudential Financial 5.375% 5/15/45 USD 6508290.68 0.0150 NaN NaN
(241, 7)
df2
Short Name ISIN
0 ABU DHABI COMMER AEA000201011
1 ABU DHABI NATION AEA002401015
2 ABU DHABI NATION AEA006101017
3 ADNOC DRILLING C AEA007301012
4 ALPHA DHABI HOLD AEA007601015
(66987, 2)
編輯 2 :從數據幀中獲取匹配的模糊邏輯
df1 = pd.read_excel('file.xlsx', sheet_name=1, usecols=[1, 2, 3, 4, 5, 6, 8], header=1)
df2 = pd.read_excel("Excel files/file2.xlsx", sheet_name=0, usecols=[1, 2], header=1)
# empty lists for storing the matches
# later
mat1 = []
mat2 = []
p = []
# converting dataframe column
# to list of elements
# to do fuzzy matching
list1 = df1['Name'].tolist()
list2 = df2['Short Name'].tolist()
# taking the threshold as 80
threshold = 93
# iterating through list1 to extract
# it's closest match from list2
for i in list1:
mat1.append(process.extractOne(i, list2, scorer=fuzz.token_set_ratio))
df1['matches'] = mat1
# iterating through the closest matches
# to filter out the maximum closest match
for j in df1['matches']:
if j[1] >= threshold:
p.append(j[0])
mat2.append(",".join(p))
p = []
# storing the resultant matches back
# to df1
df1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using token_set_ratio():")
#print(df1.to_csv('todays-result1.csv'))
print(df1.head(20))
假設您的第一個 dataframe 的 ISIN 填寫到 null,那么簡單的合並就可以滿足您的需要。 如果您需要保留第一個 dataframe 中的非空 ISIN,則需要使用 boolean 掩碼:-
df1 = pd.DataFrame(
[[None, "Apple", "appl"],
[None, "Google", "ggl"],
[None, "Amazon", 'amzn']],
columns=["ISIN", "Name", "matches"]
)
df2 = pd.DataFrame(
[["ISIN1", "appl"],
["ISIN2", "ggl"]],
columns= ["ISIN", "Short Name"]
)
missing_isin = df1['ISIN'].isnull()
df1.loc[missing_isin, 'ISIN'] = df1.loc[missing_isin][['matches']].merge(
df2[['ISIN', 'Short Name']],
how='left',
left_on='matches',
right_on='Short Name'
)['ISIN']
left_on / right_on
:- 與數據幀匹配的列名
how='left'
:- (簡單來說)保留最左邊的 dataframe 的順序/索引,查看文檔了解更多信息
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.