I am looking to develop some generic logic that will allow me to perform reconciliation between 2 datasets.
I have 2 dataframes and I want to loop through every row value in df1 and check if it exists in df2. If it does exist I want to create a new column 'Match' in df1 with the value 'Yes' and if it does not exist I want to append the missing values in a separate df which I will print to csv.
Example datasets:
df1:
ID Name Age
1 Adam 45
2 Bill 44
3 Claire 23
df2:
ID Name Age
1 Adam 45
2 Bill 44
3 Claire 23
4 Bob 40
5 Chris 21
The column names in the 2 dataframes I've used here are just for reference. But essentially I want to check if the row (1, Adam, 45) in df1 exists in df2.
The output for df3 would look like this: df3:
ID Name Age
4 Bob 40
5 Chris 21
The updated df1 would look like this: df2:
ID Name Age Match
1 Adam 45 Yes
2 Bill 44 Yes
3 Claire 23 Yes
To be clear, I understand that this can be done using a merge or isin, but would like a fluid solution that can be used for any dataset.
I appreciate this might be a big ask as I haven't provided much guidline but any help with this would be great!!
Thanks!!
You need to use merge
here and utilize the indicator=True
feature:
df_all = df1.merge(df2, on=['ID'], how='outer', indicator=True)
df3 = df_all[df_all['_merge'] == 'right_only'].drop(columns=['Name_x', 'Age_x']).rename(columns={'Name_y': 'Name', 'Age_y': 'Age'})[['ID', 'Name', 'Age']]
df2 = df_all[df_all['_merge'] == 'both'].drop(columns=['Name_x', 'Age_x']).rename(columns={'Name_y': 'Name', 'Age_y': 'Age'})[['ID', 'Name', 'Age']]
print(df3)
print(df2)
df3:
ID Name Age
3 4 Bob 40
4 5 Chris 21
df2:
ID Name Age
0 1 Adam 45
1 2 Bill 44
2 3 Claire 23
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.