How to merge two dfs which have duplicates in both

Question

I have two dataframes df1 and df2 which have the duplicates rows in both. I want to merge these dfs. What i tried so far is to remove duplicates from one of the dataframe df2 as i need all the rows from the df1 .

Question might be a duplicate one but i didn't find any solution/hints for this particular scenario.

data = {'Name':['ABC', 'DEF', 'ABC','MNO', 'XYZ','XYZ','PQR','ABC'],
        'Age':[1,2,3,4,2,1,2,4]}
data2 = {'Name':['XYZ', 'NOP', 'ABC','MNO', 'XYZ','XYZ','PQR','ABC'],
        'Sex':['M','F','M','M','M','M','F','M']}
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2)

dfn = df1.merge(df2.drop_duplicates('Name'),on='Name')
print(dfn)

Result of above snippet:

  Name  Age Sex
0  ABC    1   M
1  ABC    3   M
2  ABC    4   M
3  MNO    4   M
4  XYZ    2   M
5  XYZ    1   M
6  PQR    2   F

This works perfectly well for the above data, but i have a large data and this method is behaving differently as im getting lots more rows than expected in dfn

I suspect due to large data and more duplicates im getting those extra rows but im cannot afford to delete the duplicate rows from df1 .

Apologies as im not able to share the actual data as it is too large! Edit: A sample result from the actual data: df2 after removing dups and the result dfn and i have only one entry in df1 for both ABC and XYZ:

Thanks in advance!

Answer 1

Try to drop_duplicates from df1 too:

dfn = pd.merge(df1.drop_duplicates('Name'),
               df2.drop_duplicates('Name'),
               on='Name')

How to merge two dfs which have duplicates in both

Question

1 answers

solution1
1 2021-10-25 20:17:56

How to merge two dfs which have duplicates in both

Question

1 answers

solution1 1 2021-10-25 20:17:56

solution1
1 2021-10-25 20:17:56