[英]best way to match one column in dataframe to multiple columns in another dataframe
[英]How can I look through 3 diferent columns to match a common number with one column of another dataframe to merge in the data (and if no match append)?
我很難理解如何將df
與df1
合並為最終的 output(見下文)。 我已經在每個列上分別嘗試pd.merge
1 次,總共 3 次。 在執行 pd.merge 之前,我將 header 列更改為 ID1、ID2 或 ID2。 最后,如果嘗試合並后ID列中的所有值都是NaN
,那么我將append那行數據。 我想知道是否有更簡單的方法可以做到這一點。
編輯:一條規則是您不能在“ Account
”字段上合並。 在我的實際數據中,有時兩個數據框中的Account
字段略有不同,所以我必須在ID
字段上合並!
東風:
Account ID1 ID2 ID3 Revenue
0 A 123 789 567 900
1 B 321 234 213 400
2 C 456 700
df1:
Account Industry ID
0 A Tech 123
1 B Retail 213
2 D Legal 111
output:
Account Industry ID Revenue
0 A Tech 123 900
1 B Retail 213 400
2 C 456 700
3 D Legal 111 0
利用:
# step 1a
df2 = df.melt(id_vars=['Account', 'Revenue'], value_name='ID').drop('variable', 1)
# Step 1b (Edited by David Erickson (OP), I needed the column to be a string) in order to merge. Also, I had to have NaNs for step 4, in order for it to properly bring in the ID for Account C.
df2['ID'] = df2['ID'].astype(str).replace('', np.nan, regex=True)
# step 2
df3 = pd.merge(df1, df2, on='ID', how='outer').dropna(subset=['ID'])
# step 3
df3['Account_x'] = df3['Account_x'].fillna(df3.pop('Account_y'))
# step 4
df3 = (
df3.drop_duplicates(subset=['Account_x'])
.rename({'Account_x': 'Account'}, axis=1)
.sort_values(by='Account')
.reset_index(drop=True)
)
腳步:
# step 1: df2
Account Revenue ID
0 A 900 123.0
1 B 400 321.0
2 C 700 NaN
3 A 900 789.0
4 B 400 234.0
5 C 700 456.0
6 A 900 567.0
7 B 400 213.0
8 C 700 NaN
# step 2: df3
Account_x Industry ID Account_y Revenue
0 A Tech 123.0 A 900.0
1 B Retail 213.0 B 400.0
2 D Legal 111.0 NaN NaN
3 NaN NaN 321.0 B 400.0
6 NaN NaN 789.0 A 900.0
7 NaN NaN 234.0 B 400.0
8 NaN NaN 456.0 C 700.0
9 NaN NaN 567.0 A 900.0
# step 3: df3
Account_x Industry ID Revenue
0 A Tech 123.0 900.0
1 B Retail 213.0 400.0
2 D Legal 111.0 NaN
3 B NaN 321.0 400.0
6 A NaN 789.0 900.0
7 B NaN 234.0 400.0
8 C NaN 456.0 700.0
9 A NaN 567.0 900.0
# step 4: df3
Account Industry ID Revenue
0 A Tech 123.0 900.0
1 B Retail 213.0 400.0
2 C NaN 456.0 700.0
3 D Legal 111.0 NaN
您可以將 ID1、ID2、ID3 列放在 ID 列中並復制帳戶和收入的條目。
之后,您可以在兩個數據幀上執行左連接
編輯對於代碼部分:
import pandas as pd
import numpy as np
df = pd.DataFrame([
['A', 123, 789, 567, 900],
['B', 321, 234, 213, 400],
['C', None, 456, None, 700]
], columns=['Account', 'ID1', 'ID2', 'ID3', 'Revenue'])
df1 = pd.DataFrame([
['A', 'Tech', 123],
['B', 'Retail', 213],
['D', 'Legal', 111]
], columns = ['Account', 'Industry', 'ID'])
df_new = pd.DataFrame(columns=['Account', 'ID', 'Revenue'])
for ix in ['ID1', 'ID2', 'ID3']:
df_new = df_new.append(pd.DataFrame(df[['Account', ix, 'Revenue']].values,
columns=['Account', 'ID', 'Revenue']))
df_new = df_new.dropna()
df_new['ID'] = df_new['ID'].astype(int)
df_new.set_index('ID', inplace=True)
df1.set_index('ID', inplace=True)
output = df1.join(df_new, how='left', lsuffix='_from_df_new')
missing_accounts = set(df['Account'].unique()) - set(output['Account_from_df_new'].unique())
output = output.append(df_new[df_new['Account'].isin(missing_accounts)])
output['Account'] = output.apply(
lambda row:
row['Account']
if not pd.isnull(row['Account'])
else row['Account_from_df_new'], axis=1)
output.drop(columns=['Account_from_df_new']).reset_index()
Output:
ID Account Industry Revenue
0 123 A Tech 900
1 213 B Retail 400
2 111 D Legal NaN
3 456 C NaN 700
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.