简体   繁体   English

仅使用pandas在组内获取重复值

[英]Only get duplicated values within groups with pandas

I have a data frame such as : 我有一个数据框,如:

groups  ids numbers
group3  id4 89
group1  id1 50
group1  id1 30
group1  id2 90
group2  id4 89
group2  id6 76
group3  id4 90

and the idea it to find with groupby groups the duplicated ids and get a new data frame with only duplicated ids by groups such as: 并且它想要通过groupby组找到重复的id并获得一个新的数据框,只有按组重复的ID,例如:

group1  id1 50
group1  id1 30
group3  id4 89
group3  id4 90

I tried: 我试过了:

for groups in df.groupby('groups'):
 print(df['ids'].duplicated)

Thanks for your help. 谢谢你的帮助。

Function groupby is not necessary, for better performance use DataFrame.duplicated by multiple columns and parameter keep=False for get all dupes, then filter by boolean indexing : 函数groupby不是必需的,为了更好的性能,使用DataFrame.duplicated由多列和参数keep=False获取所有dupes,然后通过boolean indexing过滤:

df = df[df.duplicated(['groups','ids'], keep=False)]
print (df)
   groups  ids  numbers
0  group3  id4       89
1  group1  id1       50
2  group1  id1       30
6  group3  id4       90

If sorting necessary add DataFrame.sort_values with DataFrame.reset_index for default index: 如果需要排序, DataFrame.sort_values使用DataFrame.reset_index为默认索引添加DataFrame.sort_values

df = (df[df.duplicated(['groups','ids'], keep=False)]
         .sort_values(['groups','ids'])
         .reset_index(drop=True))
print (df)
   groups  ids  numbers
0  group1  id1       50
1  group1  id1       30
2  group3  id4       89
3  group3  id4       90

You can use: 您可以使用:

df.groupby('groups').apply(lambda x: \
            x[x.duplicated('ids',keep=False)]).reset_index(drop=True)

Output: 输出:

   groups  ids  numbers
0  group1  id1       50
1  group1  id1       30
2  group3  id4       89
3  group3  id4       90

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM