I have two dataframes df1 and df2 which have the duplicates rows in both. I want to merge these dfs. What i tried so far is to remove duplicates from one of the dataframe df2 as i need all the rows from the df1 .
Question might be a duplicate one but i didn't find any solution/hints for this particular scenario.
data = {'Name':['ABC', 'DEF', 'ABC','MNO', 'XYZ','XYZ','PQR','ABC'],
'Age':[1,2,3,4,2,1,2,4]}
data2 = {'Name':['XYZ', 'NOP', 'ABC','MNO', 'XYZ','XYZ','PQR','ABC'],
'Sex':['M','F','M','M','M','M','F','M']}
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
dfn = df1.merge(df2.drop_duplicates('Name'),on='Name')
print(dfn)
Result of above snippet:
Name Age Sex
0 ABC 1 M
1 ABC 3 M
2 ABC 4 M
3 MNO 4 M
4 XYZ 2 M
5 XYZ 1 M
6 PQR 2 F
This works perfectly well for the above data, but i have a large data and this method is behaving differently as im getting lots more rows than expected in dfn
I suspect due to large data and more duplicates im getting those extra rows but im cannot afford to delete the duplicate rows from df1 .
Apologies as im not able to share the actual data as it is too large! Edit: A sample result from the actual data: df2 after removing dups and the result dfn and i have only one entry in df1 for both ABC and XYZ:
Thanks in advance!
Try to drop_duplicates
from df1 too:
dfn = pd.merge(df1.drop_duplicates('Name'),
df2.drop_duplicates('Name'),
on='Name')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.