简体   繁体   English

Python pandas 数据框,检查列值是否与每一行的其他列值匹配

[英]Python pandas dataframe, check if column value matches other column value for every row

Imagine having the columns username and userid.想象一下有列用户名和用户 ID。

Username UserId
user1    1
user1    1
user2    2
user3    1    <- this is wrong for example
user1    1

User3 has the same userid as user1, which is not supposed to be possible. User3 与 user1 具有相同的用户 ID,这应该是不可能的。 How can I check if any of these occurences exist?如何检查是否存在这些情况?

First remove all duplicates by both columns:首先删除两列的所有重复项:

df1  = df.drop_duplicates(['Username','UserId'])
print (df1)
  Username  UserId
0    user1       1
2    user2       2
3    user3       1

And then get all duplicates by UserId - but here still more logic for ditingush if wrong value is for user1 or user3 :然后通过UserId获取所有重复项 - 但如果user1user3值错误,这里还有更多逻辑用于 ditingush :

dups = df1[df1['UserId'].duplicated(keep=False)]
print (dups)
  Username  UserId
0    user1       1
3    user3       1

Sample data - added next user4 for I hope better data:示例数据 - 为我希望更好的数据添加了下一个user4

print (df)
  Username  UserId
0    user1       1
1    user1       1
2    user2       2
3    user3       1
4    user4       1
5    user4       1
6    user1       1

One idea is get counts per groups by both columns by GroupBy.transform :一种想法是通过GroupBy.transform获取两列的GroupBy.transform

df['count'] = df.groupby(['Username','UserId'])['UserId'].transform('size')
print (df)
  Username  UserId  count
0    user1       1      3
1    user1       1      3
2    user2       2      1
3    user3       1      1
4    user4       1      2
5    user4       1      2
6    user1       1      3

Then remove duplicates and sorting by DataFrame.sort_values :然后删除重复项并按DataFrame.sort_values排序:

df1  = df.drop_duplicates(['Username','UserId']).sort_values(['UserId','count'])
print (df1)
  Username  UserId  count
3    user3       1      1
4    user4       1      2
0    user1       1      3
2    user2       2      1

Get all dupes:获取所有欺骗:

mask1 = df1['UserId'].duplicated(keep=False)
dups = df1[mask1]
print (dups)
  Username  UserId  count
3    user3       1      1
4    user4       1      2
0    user1       1      3

Remove dupe with maximal count by chain Series.duplicated with keep='last' :通过链Series.duplicated with keep='last'删除最大计数的欺骗:

dups_without_max_count = df1[df1['UserId'].duplicated(keep='last') & mask1]
print (dups_without_max_count)
  Username  UserId  count
3    user3       1      1
4    user4       1      2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 获取列值与列表匹配的数据框中的每一行:Pandas - Get every row in a dataframe whose column value matches a list: Pandas 熊猫DataFrame检查其他列中的列值 - Pandas DataFrame check colums value in other column 对于Pandas数据框中的每一行,确定另一列中是否存在一列值 - For every row in Pandas dataframe determine if a column value exists in another column 在行匹配条件的Pandas DataFrame中获取第一列值 - Get first column value in Pandas DataFrame where row matches condition 当列值匹配时,Pandas Dataframe会从行中替换Nan - Pandas Dataframe replace Nan from a row when a column value matches (行、列):值到 Pandas DataFrame - (Row, Column) : Value to Pandas DataFrame 检查值是否在 Pandas dataframe 列中 - Check if value is in Pandas dataframe column Pandas/Python:根据行值和其他 DataFrame 设置新列的值 - Pandas/Python: Set value of new column based on row value and other DataFrame 如果列名与另一个 DataFrame pandas 的行值匹配,则获取 DataFrame 的列值 - Get column values of a DataFrame if column name matches row value of another DataFrame pandas 如果“行”,“列”中的值与另一列中的任何地方匹配,则删除“熊猫数据框”中的行 - Delete Row in Pandas Dataframe if value in Row, Column Matches Anywhere in Another Column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM