I have a dataframe a thousands of rows long that looks like this:
ID Email Address
1 ... ...
2 ... ...
3 ... ...
4 ... ...
1 ... ...
2 ... ...
5 ... ...
5 ... ...
6 ... ...
what I want to do is drop duplicates of ID so there is only one ID per person. I can't use drop_duplicates() because most people don't have ID's and this drops them too (not good!)
Is there a way to remove specific rows and only keep one instance of the IDs.
I have a dataframe of all the duplicate ID I want to remove if that helps. eg for the example I gave above:
ID Email Address
1 ... ...
2 ... ...
5 ... ...
Maybe there's a way to turn this to a series/array of IDs and remove from the df that way?
I believe you need chain 2 conditions - duplicated
with keep=False
for all dupes with no parameter for first dupes:
df = df[df.duplicated(subset='ID', keep=False) & df.duplicated(subset='ID')]
print (df)
ID Email Address
4 1 ... ...
5 2 ... ...
7 5 ... ...
Is this what you want?
df[df.duplicated(subset='ID')]
ID Email Address
4 1 ... ...
5 2 ... ...
7 5 ... ...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.