简体   繁体   中英

Merge Two DFs Without Dupes

I am trying to merge two data frames and eliminate dupes.

Here is DF#1:

import pandas as pd
data1 = {'id':['168'],'group_id':['360002136432'],'ticket_form_id':['360000159712']}
df1 = pd.DataFrame(data1)
print(df1)

Here is DF #2

data2 = {'id':['362936613051','362936613051','362936613051'],'ticket_id':['168','168','168']}
df2 = pd.DataFrame(data2)
print(df2)

I am trying to merge, or consolidate, DF#1 and DF#2, so it looks like this.

id  group_id    ticket_form_id  ID
168 360002136432    360000159712    362936613051

It would be some kind of inner join (I think) between DF#1.id and DF#2.ticket_id, but I keep getting a bunch of dupes in the merged data frame. How can I eliminate dupes in the merged data frame.

So, for ID = 8, I would expect to see 362563740691 and for ID = 10, I would expect to see 362563746711.

在此处输入图片说明

Instead, I'm seeing 362785076491 for ID = 8.

在此处输入图片说明

Your df2 do have a lot of duplicate values. I don't know if you need to keep redundant data, if you don't need to you can drop df2 duplicates

df2.drop_duplicates(inplace = True)
print(df1.merge(df2, left_on = 'id', right_on = 'ticket_id'))

This immediately remove the duplicate rows in the final dataframe.

Another possibility is to remove duplicated rows after the merge.

df1 = df1.merge(df2, left_on = 'id', right_on = 'ticket_id', how = 'inner')                                                                                
df1.drop_duplicates(inplace = True)                                                                                                                        
print(df1) 

假设df2中的所有id / ticket_id对均已复制,如示例所示:

df_new=df1.merge(df2[~df2.duplicated()==1], left_on='id', right_on='ticket_id').drop('ticket_id', axis=1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM