Imagine having the columns username and userid.
Username UserId
user1 1
user1 1
user2 2
user3 1 <- this is wrong for example
user1 1
User3 has the same userid as user1, which is not supposed to be possible. How can I check if any of these occurences exist?
First remove all duplicates by both columns:
df1 = df.drop_duplicates(['Username','UserId'])
print (df1)
Username UserId
0 user1 1
2 user2 2
3 user3 1
And then get all duplicates by UserId
- but here still more logic for ditingush if wrong value is for user1
or user3
:
dups = df1[df1['UserId'].duplicated(keep=False)]
print (dups)
Username UserId
0 user1 1
3 user3 1
Sample data - added next user4
for I hope better data:
print (df)
Username UserId
0 user1 1
1 user1 1
2 user2 2
3 user3 1
4 user4 1
5 user4 1
6 user1 1
One idea is get counts per groups by both columns by GroupBy.transform
:
df['count'] = df.groupby(['Username','UserId'])['UserId'].transform('size')
print (df)
Username UserId count
0 user1 1 3
1 user1 1 3
2 user2 2 1
3 user3 1 1
4 user4 1 2
5 user4 1 2
6 user1 1 3
Then remove duplicates and sorting by DataFrame.sort_values
:
df1 = df.drop_duplicates(['Username','UserId']).sort_values(['UserId','count'])
print (df1)
Username UserId count
3 user3 1 1
4 user4 1 2
0 user1 1 3
2 user2 2 1
Get all dupes:
mask1 = df1['UserId'].duplicated(keep=False)
dups = df1[mask1]
print (dups)
Username UserId count
3 user3 1 1
4 user4 1 2
0 user1 1 3
Remove dupe with maximal count by chain Series.duplicated
with keep='last'
:
dups_without_max_count = df1[df1['UserId'].duplicated(keep='last') & mask1]
print (dups_without_max_count)
Username UserId count
3 user3 1 1
4 user4 1 2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.