[英]Finding Duplicated value acorss groups in Pandas GroupBy
想法是將MultIndex
轉換為 3 列DataFrame
,然后通過DataFrame.pivot
刪除非重復行的DataFrame.dropna
和公共值在索引中:
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': [3,4,5,8,10,12,14,12]})
df = df.groupby(['A','B']).sum()
common = df.reset_index().pivot('C','A','B').dropna().index
print (common)
Int64Index([12], dtype='int64', name='C')
然后,如果要過濾原始數據,請使用boolean indexing
:
df = df[df['C'].isin(common)]
print (df)
C
A B
bar two 12
foo three 12
如果想要至少在 2 組中重復的公共行解決方案是:
print (df)
A B C
0 foo one 3
1 bar one 4
2 foo two 3
3 bar three 8
4 foo two 14
5 bar two 12
6 foo one 14
7 foo three 12
8 xxx yyy 8
df = df.groupby(['A','B']).sum()
print (df)
C
A B
bar one 4
three 8 <- dupe per bar, three
two 12 <- dupe per bar, two
foo one 17 <-17 is duplicated per group foo, one, so omited
three 12 <- dupe per foo, three
two 17 <-17 is duplicated per group foo, one, so omited
xxx yyy 8 <- dupe per xxx, yyy
common1 = (df.reset_index()
.pivot_table(index='C',columns='A', values='B', aggfunc='size')
.notna()
.sum(axis=1)
)
common1 = common1.index[common1.gt(1)]
print (common1)
Int64Index([8, 12], dtype='int64', name='C')
df1 = df[df['C'].isin(common1)]
print (df1)
C
A B
bar three 8
two 12
foo three 12
xxx yyy 8
為了展示一個更有啟發性的例子,我在源 DataFrame 中添加了一行,以便它包含:
A B C
0 foo one 3
1 bar one 4
2 foo two 5
3 bar three 8
4 foo two 10
5 bar two 12
6 foo one 14
7 foo three 12
8 xxx yyy 8
我將分組結果保存在另一個 DataFrame 中:
df2 = df.groupby(['A','B']).sum()
所以它包含:
C
A B
bar one 4
three 8
two 12
foo one 17
three 12
two 15
xxx yyy 8
如您所見, C 中有兩個重復值: 12和8 。 請注意,現在df2 中的索引是unique 。
然后,要顯示重復值及其組,請運行:
df2[df2.duplicated(keep=False)].sort_values('C')
得到:
C
A B
bar three 8
xxx yyy 8
bar two 12
foo three 12
上面的結果顯示了所有重復的值和它們所在的組( A和B )。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.