Pandas：根據條件將值從一個 dataframe 合並到另一個

Question

使用模糊邏輯和fuzzywuzzy模塊，我能夠將名稱（來自一個數據幀）與短名稱（來自另一個數據幀）匹配。 這兩個數據框還包含一個表 ISIN。

這是應用邏輯后得到的 dataframe。

ISIN                                      Name Currency         Value  % Weight  Asset Type Comments/ Assumptions          matches
236   NaN            Partnerre Ltd 4.875% Perp Sr:J      USD  1.684069e+05    0.0004         NaN                   NaN
237   NaN  Berkley (Wr) Corporation 5.700% 03/30/58      USD  6.955837e+04    0.0002         NaN                   NaN
238   NaN             Tc Energy Corp Flt Perp Sr:11      USD  6.380262e+04    0.0001         NaN                   NaN   TC ENERGY CORP
239   NaN                      Cash and Equivalents      USD  2.166579e+07    0.0499         NaN                   NaN
240   NaN                                       AUM      NaN  4.338766e+08    0.9999         NaN                   NaN  AUM IND BARC US

創建了一個新列“匹配”，這基本上意味着來自第二個 dataframe 的短名稱與來自第一個 dataframe 的名稱匹配。

來自 dataframe1 的 ISIN 為空，來自 dataframe2 的 ISIN 存在。 在隨后的匹配中（第一個 Dataframe 的名稱和第二個數據幀的短名稱），我想將第二個 dataframe 中的相關 ISIN 添加到第一個 Z6A8064B5DF479455500553C47DZ55500553C47DZC。

如何從第二個 dataframe 到第一個 dataframe 獲取 ISIN，以便我的最終 output 看起來像這樣？

ISIN                                      Name Currency         Value  % Weight  Asset Type Comments/ Assumptions          matches
236   NaN            Partnerre Ltd 4.875% Perp Sr:J      USD  1.684069e+05    0.0004         NaN                   NaN
237   NaN  Berkley (Wr) Corporation 5.700% 03/30/58      USD  6.955837e+04    0.0002         NaN                   NaN
238   78s9             Tc Energy Corp Flt Perp Sr:11      USD  6.380262e+04    0.0001         NaN                   NaN   TC ENERGY CORP
239   NaN                      Cash and Equivalents      USD  2.166579e+07    0.0499         NaN                   NaN
240   123e                                       AUM      NaN  4.338766e+08    0.9999         NaN                   NaN  AUM IND BARC US

編輯：數據框及其原始形式 df1

ISIN                                 Name Currency       Value  % Weight  Asset Type                              Comments/ Assumptions
0   NaN     Transcanada Trust 5.875 08/15/76      USD  7616765.00    0.0176         NaN  https://assets.cohenandsteers.com/assets/conte...
1   NaN      Bp Capital Markets Plc Flt Perp      USD  7348570.50    0.0169         NaN  Holding value for each constituent is derived ...
2   NaN       Transcanada Trust Flt 09/15/79      USD  7341250.00    0.0169         NaN                                                NaN
3   NaN      Bp Capital Markets Plc Flt Perp      USD  6734022.32    0.0155         NaN                                                NaN
4   NaN  Prudential Financial 5.375% 5/15/45      USD  6508290.68    0.0150         NaN                                                NaN
(241, 7)

df2

Short Name          ISIN
0  ABU DHABI COMMER  AEA000201011
1  ABU DHABI NATION  AEA002401015
2  ABU DHABI NATION  AEA006101017
3  ADNOC DRILLING C  AEA007301012
4  ALPHA DHABI HOLD  AEA007601015
(66987, 2)

編輯 2 ：從數據幀中獲取匹配的模糊邏輯

df1 = pd.read_excel('file.xlsx', sheet_name=1, usecols=[1, 2, 3, 4, 5, 6, 8], header=1)
df2 = pd.read_excel("Excel files/file2.xlsx", sheet_name=0, usecols=[1, 2], header=1)

# empty lists for storing the matches
# later
mat1 = []
mat2 = []
p = []

# converting dataframe column
# to list of elements
# to do fuzzy matching
list1 = df1['Name'].tolist()
list2 = df2['Short Name'].tolist()

# taking the threshold as 80
threshold = 93

# iterating through list1 to extract
# it's closest match from list2
for i in list1:
    mat1.append(process.extractOne(i, list2, scorer=fuzz.token_set_ratio))
df1['matches'] = mat1

# iterating through the closest matches
# to filter out the maximum closest match
for j in df1['matches']:
    if j[1] >= threshold:
        p.append(j[0])
    mat2.append(",".join(p))
    p = []

# storing the resultant matches back
# to df1
df1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using token_set_ratio():")
#print(df1.to_csv('todays-result1.csv'))
print(df1.head(20))

Answer 1

假設您的第一個 dataframe 的 ISIN 填寫到 null，那么簡單的合並就可以滿足您的需要。 如果您需要保留第一個 dataframe 中的非空 ISIN，則需要使用 boolean 掩碼：-

df1 = pd.DataFrame(
  [[None, "Apple", "appl"], 
  [None, "Google", "ggl"], 
  [None, "Amazon", 'amzn']], 
  columns=["ISIN", "Name", "matches"]
)

df2 = pd.DataFrame(
  [["ISIN1", "appl"], 
  ["ISIN2", "ggl"]], 
  columns= ["ISIN", "Short Name"]
)

missing_isin = df1['ISIN'].isnull()

df1.loc[missing_isin, 'ISIN'] = df1.loc[missing_isin][['matches']].merge(
    df2[['ISIN', 'Short Name']], 
    how='left', 
    left_on='matches', 
    right_on='Short Name'
)['ISIN']

left_on / right_on :- 與數據幀匹配的列名

how='left' :- （簡單來說）保留最左邊的 dataframe 的順序/索引，查看文檔了解更多信息

Pandas：根據條件將值從一個 dataframe 合並到另一個

問題描述

1 個解決方案

解決方案1
1 已采納 2021-12-17 10:49:43

Pandas：根據條件將值從一個 dataframe 合並到另一個

問題描述

1 個解決方案

解決方案1 1 已采納 2021-12-17 10:49:43

解決方案1
1 已采納 2021-12-17 10:49:43