[英]join 2 data sets and compare in python
有人可以使用 python 代碼幫助解決以下問題嗎?
文件 A
ID, items, Amount
A1, 10, 100
A2, 20, 200
A3, 30, 300
文件 B
ID, items, Amount
A1, 10, 100
A2, 12, 120
A4, 40, 400
我需要最終的 output 如下
FileA-ID, FileB-ID, Match?, FileA-Items, FileB-items, Match?, FileA-Amount, FileB-Amount, Match?
A1, A1, Y, 10, 10, Y, 100, 100, Y,
A2, A2, Y, 20, 12, N, 100, 120, N,
A3, NAN, N, 30, NAN, N, 300, NAN, N,
NAN, A4, N, NAN, 40, N, NAN, 400, N
這將是一個每月的過程,所以我想讓代碼通用,這樣我就可以每月在一個新文件上重新運行。
首先,您可以將原始列跟蹤到列表中。
cols = df1.columns.tolist()
然后將ID
列設置為索引,並正確重命名兩個數據框的列標題。 沿列連接兩個數據框。
df1 = df1.set_index('ID')
df2 = df2.set_index('ID')
df1['ID'] = df1.index
df2['ID'] = df2.index
df1 = df1.rename(lambda col: f'FileA-{col}', axis=1)
df2 = df2.rename(lambda col: f'FileB-{col}', axis=1)
df_ = pd.concat([df1, df2], axis=1)
print(df_)
FileA-items FileA-Amount FileA-ID FileB-items FileB-Amount FileB-ID
ID
A1 10.0 100.0 A1 10.0 100.0 A1
A2 20.0 200.0 A2 12.0 120.0 A2
A3 30.0 300.0 A3 NaN NaN NaN
A4 NaN NaN NaN 40.0 400.0 A4
最后,根據 df1 和 df2 中的同一列創建Match
列。 然后對列 header 進行排序。
for col in cols:
df_[f'Match-{col}'] = np.where((df_[f'FileA-{col}'] == df_[f'FileB-{col}']), 'Y', 'N')
df_ = df_.reindex(sorted(df_.columns, key = lambda x: cols.index(x.split('-')[1])), axis=1)
print(df_)
FileA-ID FileB-ID Match-ID FileA-items FileB-items Match-items FileA-Amount FileB-Amount Match-Amount
ID
A1 A1 A1 Y 10.0 10.0 Y 100.0 100.0 Y
A2 A2 A2 Y 20.0 12.0 N 200.0 120.0 N
A3 A3 NaN N 30.0 NaN N 300.0 NaN N
A4 NaN A4 N NaN 40.0 N NaN 400.0 N
print(df_.reset_index(drop=True))
FileA-ID FileB-ID Match-ID FileA-items FileB-items Match-items FileA-Amount FileB-Amount Match-Amount
0 A1 A1 Y 10.0 10.0 Y 100.0 100.0 Y
1 A2 A2 Y 20.0 12.0 N 200.0 120.0 N
2 A3 NaN N 30.0 NaN N 300.0 NaN N
3 NaN A4 N NaN 40.0 N NaN 400.0 N
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.