简体   繁体   中英

How can I eliminate a duplicated row of a form submission in Python with Pandas?

I've got a dataset of form submissions - and some of the forms have been submitted multiple times.

The same person, same selections in the form, but slightly different submission_ids and submission dates.

I want to remove one of the submissions (I'll say the 2nd one, but it shouldn't matter because they are identical). If I do:

lit_subset[lit_subset.duplicated()]

I either don't get what I want (because the submission_ids are unique) or if I subset the columns (remove the submission_id and submission_date) then I can see which records are duped up, but I don't know how to grab one of the submission_ids and remove it from the original dataset. This is an easy thing for me to do in SQL Server:

select first_name
    ,last_name
    ,email
    ,telephone
    ,accountNumber
    ,refund_option
    ,max(submission_id) as 'max_submission'
from #refund_form_data
group by first_name
    ,last_name
    ,email
    ,telephone
    ,accountNumber
    ,refund_option
having count(*) > 1

Here's a sample dataset:

import pandas as pd

data = {'submission_id':  ['abc456', 'abc123','def456','ghi789'],
        'first_name': ['Mark', 'Mark','Andrew','Allie'],
        'last_name': ['Baseball', 'Baseball','football','hockey'],
        'choice': ['Athletics', 'Athletics','Falcons','Canucks'],
        }

df = pd.DataFrame (data, columns = ['submission_id', 'first_name','last_name','choice'])

print(df)

I'd like an output that looks like this:

  submission_id first_name last_name     choice
0        abc123       Mark  Baseball  Athletics
1        def456     Andrew  football    Falcons
2        ghi789      Allie    hockey    Canucks

In your example, you can do something like:

_df = df.groupby(['first_name','last_name','choice'],as_index=False)['submission_id'].head(1)

df = df.merge(_df,how='inner')

Or do this if you want max :

df = df.groupby(['first_name','last_name','choice'],as_index=False)['submission_id'].max()

Using the drop_duplicates method you can choose which columns to consider using the subset argument:

df.drop_duplicates(subset=['first_name', 'last_name', 'choice'], inplace=True) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM