简体   繁体   中英

How to merge two dfs which have duplicates in both

I have two dataframes df1 and df2 which have the duplicates rows in both. I want to merge these dfs. What i tried so far is to remove duplicates from one of the dataframe df2 as i need all the rows from the df1 .

Question might be a duplicate one but i didn't find any solution/hints for this particular scenario.

data = {'Name':['ABC', 'DEF', 'ABC','MNO', 'XYZ','XYZ','PQR','ABC'],
        'Age':[1,2,3,4,2,1,2,4]}
data2 = {'Name':['XYZ', 'NOP', 'ABC','MNO', 'XYZ','XYZ','PQR','ABC'],
        'Sex':['M','F','M','M','M','M','F','M']}
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2)

dfn = df1.merge(df2.drop_duplicates('Name'),on='Name')
print(dfn) 

Result of above snippet:

  Name  Age Sex
0  ABC    1   M
1  ABC    3   M
2  ABC    4   M
3  MNO    4   M
4  XYZ    2   M
5  XYZ    1   M
6  PQR    2   F

This works perfectly well for the above data, but i have a large data and this method is behaving differently as im getting lots more rows than expected in dfn

I suspect due to large data and more duplicates im getting those extra rows but im cannot afford to delete the duplicate rows from df1 .

Apologies as im not able to share the actual data as it is too large! Edit: A sample result from the actual data: df2 after removing dups and the result dfn and i have only one entry in df1 for both ABC and XYZ:

在此处输入图片说明 在此处输入图片说明

Thanks in advance!

Try to drop_duplicates from df1 too:

dfn = pd.merge(df1.drop_duplicates('Name'),
               df2.drop_duplicates('Name'),
               on='Name')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM