I've got a dataset of form submissions - and some of the forms have been submitted multiple times.
The same person, same selections in the form, but slightly different submission_ids and submission dates.
I want to remove one of the submissions (I'll say the 2nd one, but it shouldn't matter because they are identical). If I do:
lit_subset[lit_subset.duplicated()]
I either don't get what I want (because the submission_ids are unique) or if I subset the columns (remove the submission_id and submission_date) then I can see which records are duped up, but I don't know how to grab one of the submission_ids and remove it from the original dataset. This is an easy thing for me to do in SQL Server:
select first_name
,last_name
,email
,telephone
,accountNumber
,refund_option
,max(submission_id) as 'max_submission'
from #refund_form_data
group by first_name
,last_name
,email
,telephone
,accountNumber
,refund_option
having count(*) > 1
Here's a sample dataset:
import pandas as pd
data = {'submission_id': ['abc456', 'abc123','def456','ghi789'],
'first_name': ['Mark', 'Mark','Andrew','Allie'],
'last_name': ['Baseball', 'Baseball','football','hockey'],
'choice': ['Athletics', 'Athletics','Falcons','Canucks'],
}
df = pd.DataFrame (data, columns = ['submission_id', 'first_name','last_name','choice'])
print(df)
I'd like an output that looks like this:
submission_id first_name last_name choice
0 abc123 Mark Baseball Athletics
1 def456 Andrew football Falcons
2 ghi789 Allie hockey Canucks
In your example, you can do something like:
_df = df.groupby(['first_name','last_name','choice'],as_index=False)['submission_id'].head(1)
df = df.merge(_df,how='inner')
Or do this if you want max
:
df = df.groupby(['first_name','last_name','choice'],as_index=False)['submission_id'].max()
Using the drop_duplicates
method you can choose which columns to consider using the subset
argument:
df.drop_duplicates(subset=['first_name', 'last_name', 'choice'], inplace=True)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.