[英]Finding Duplicated value acorss groups in Pandas GroupBy
想法是将MultIndex
转换为 3 列DataFrame
,然后通过DataFrame.pivot
删除非重复行的DataFrame.dropna
和公共值在索引中:
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': [3,4,5,8,10,12,14,12]})
df = df.groupby(['A','B']).sum()
common = df.reset_index().pivot('C','A','B').dropna().index
print (common)
Int64Index([12], dtype='int64', name='C')
然后,如果要过滤原始数据,请使用boolean indexing
:
df = df[df['C'].isin(common)]
print (df)
C
A B
bar two 12
foo three 12
如果想要至少在 2 组中重复的公共行解决方案是:
print (df)
A B C
0 foo one 3
1 bar one 4
2 foo two 3
3 bar three 8
4 foo two 14
5 bar two 12
6 foo one 14
7 foo three 12
8 xxx yyy 8
df = df.groupby(['A','B']).sum()
print (df)
C
A B
bar one 4
three 8 <- dupe per bar, three
two 12 <- dupe per bar, two
foo one 17 <-17 is duplicated per group foo, one, so omited
three 12 <- dupe per foo, three
two 17 <-17 is duplicated per group foo, one, so omited
xxx yyy 8 <- dupe per xxx, yyy
common1 = (df.reset_index()
.pivot_table(index='C',columns='A', values='B', aggfunc='size')
.notna()
.sum(axis=1)
)
common1 = common1.index[common1.gt(1)]
print (common1)
Int64Index([8, 12], dtype='int64', name='C')
df1 = df[df['C'].isin(common1)]
print (df1)
C
A B
bar three 8
two 12
foo three 12
xxx yyy 8
为了展示一个更有启发性的例子,我在源 DataFrame 中添加了一行,以便它包含:
A B C
0 foo one 3
1 bar one 4
2 foo two 5
3 bar three 8
4 foo two 10
5 bar two 12
6 foo one 14
7 foo three 12
8 xxx yyy 8
我将分组结果保存在另一个 DataFrame 中:
df2 = df.groupby(['A','B']).sum()
所以它包含:
C
A B
bar one 4
three 8
two 12
foo one 17
three 12
two 15
xxx yyy 8
如您所见, C 中有两个重复值: 12和8 。 请注意,现在df2 中的索引是unique 。
然后,要显示重复值及其组,请运行:
df2[df2.duplicated(keep=False)].sort_values('C')
得到:
C
A B
bar three 8
xxx yyy 8
bar two 12
foo three 12
上面的结果显示了所有重复的值和它们所在的组( A和B )。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.