如何查看 3 個不同的列以將一個公共數字與另一個 dataframe 的一列匹配以合並數據（如果沒有匹配追加）？

Question

我很難理解如何將df與df1合並為最終的 output（見下文）。 我已經在每個列上分別嘗試pd.merge 1 次，總共 3 次。 在執行 pd.merge 之前，我將 header 列更改為 ID1、ID2 或 ID2。 最后，如果嘗試合並后ID列中的所有值都是NaN ，那么我將append那行數據。 我想知道是否有更簡單的方法可以做到這一點。

編輯：一條規則是您不能在“ Account ”字段上合並。 在我的實際數據中，有時兩個數據框中的Account字段略有不同，所以我必須在ID字段上合並！

東風：

    Account ID1 ID2 ID3 Revenue
0   A       123 789 567 900
1   B       321 234 213 400
2   C           456     700

df1：

    Account Industry    ID
0   A       Tech        123
1   B       Retail      213
2   D       Legal       111

output：

    Account Industry    ID     Revenue
0   A       Tech        123    900
1   B       Retail      213    400
2   C                   456    700
3   D       Legal       111    0

Answer 1

利用：

# step 1a
df2 = df.melt(id_vars=['Account', 'Revenue'], value_name='ID').drop('variable', 1)
# Step 1b (Edited by David Erickson (OP), I needed the column to be a string) in order to merge. Also, I had to have NaNs for step 4, in order for it to properly bring in the ID for Account C.
df2['ID'] = df2['ID'].astype(str).replace('', np.nan, regex=True)

# step 2
df3 = pd.merge(df1, df2, on='ID', how='outer').dropna(subset=['ID'])

# step 3
df3['Account_x'] = df3['Account_x'].fillna(df3.pop('Account_y'))

# step 4
df3 = (
    df3.drop_duplicates(subset=['Account_x'])
    .rename({'Account_x': 'Account'}, axis=1)
    .sort_values(by='Account')
    .reset_index(drop=True)
)

腳步：

# step 1: df2
  Account  Revenue     ID
0       A      900  123.0
1       B      400  321.0
2       C      700    NaN
3       A      900  789.0
4       B      400  234.0
5       C      700  456.0
6       A      900  567.0
7       B      400  213.0
8       C      700    NaN

# step 2: df3
  Account_x Industry     ID Account_y  Revenue
0         A     Tech  123.0         A    900.0
1         B   Retail  213.0         B    400.0
2         D    Legal  111.0       NaN      NaN
3       NaN      NaN  321.0         B    400.0
6       NaN      NaN  789.0         A    900.0
7       NaN      NaN  234.0         B    400.0
8       NaN      NaN  456.0         C    700.0
9       NaN      NaN  567.0         A    900.0

# step 3: df3
  Account_x Industry     ID  Revenue
0         A     Tech  123.0    900.0
1         B   Retail  213.0    400.0
2         D    Legal  111.0      NaN
3         B      NaN  321.0    400.0
6         A      NaN  789.0    900.0
7         B      NaN  234.0    400.0
8         C      NaN  456.0    700.0
9         A      NaN  567.0    900.0


# step 4: df3
  Account Industry     ID  Revenue
0       A     Tech  123.0    900.0
1       B   Retail  213.0    400.0
2       C      NaN  456.0    700.0
3       D    Legal  111.0      NaN

Answer 2

您可以將 ID1、ID2、ID3 列放在 ID 列中並復制帳戶和收入的條目。

之后，您可以在兩個數據幀上執行左連接

編輯對於代碼部分：

import pandas as pd
import numpy as np 

df = pd.DataFrame([
    ['A', 123, 789, 567, 900],
    ['B', 321, 234, 213, 400],
    ['C', None, 456, None, 700]
], columns=['Account', 'ID1', 'ID2', 'ID3', 'Revenue'])
df1 = pd.DataFrame([
    ['A', 'Tech', 123],
    ['B', 'Retail', 213],
    ['D', 'Legal', 111]
], columns = ['Account', 'Industry', 'ID'])

df_new = pd.DataFrame(columns=['Account', 'ID', 'Revenue'])
for ix in ['ID1', 'ID2', 'ID3']:
    df_new = df_new.append(pd.DataFrame(df[['Account', ix, 'Revenue']].values, 
                                        columns=['Account', 'ID', 'Revenue']))
df_new = df_new.dropna()
df_new['ID'] = df_new['ID'].astype(int)
df_new.set_index('ID', inplace=True)
df1.set_index('ID', inplace=True)

output = df1.join(df_new, how='left', lsuffix='_from_df_new')
missing_accounts = set(df['Account'].unique()) - set(output['Account_from_df_new'].unique())  
output = output.append(df_new[df_new['Account'].isin(missing_accounts)])
output['Account'] = output.apply(
    lambda row: 
    row['Account'] 
    if not pd.isnull(row['Account'])
    else row['Account_from_df_new'], axis=1)

output.drop(columns=['Account_from_df_new']).reset_index()

Output：

    ID Account Industry Revenue
0  123       A     Tech     900
1  213       B   Retail     400
2  111       D    Legal     NaN
3  456       C      NaN     700

如何查看 3 個不同的列以將一個公共數字與另一個 dataframe 的一列匹配以合並數據（如果沒有匹配追加）？

問題描述

2 個解決方案

解決方案1
2 已采納 2020-06-18 09:26:51

解決方案2
1 2020-06-18 07:55:27

如何查看 3 個不同的列以將一個公共數字與另一個 dataframe 的一列匹配以合並數據（如果沒有匹配追加）？

問題描述

2 個解決方案

解決方案1 2 已采納 2020-06-18 09:26:51

解決方案2 1 2020-06-18 07:55:27

解決方案1
2 已采納 2020-06-18 09:26:51

解決方案2
1 2020-06-18 07:55:27