在 Pandas GroupBy 中查找重復的值

Question

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C': [3,4,5,8,10,12,14,12]})
df.groupby(['A','B']).sum()

如何查找 C 列中的值是否在其他組中也重復？ （這里 12 在兩組中重復）

Answer 1

想法是將MultIndex轉換為 3 列DataFrame ，然后通過DataFrame.pivot刪除非重復行的DataFrame.dropna和公共值在索引中：

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C': [3,4,5,8,10,12,14,12]})
df = df.groupby(['A','B']).sum()

common = df.reset_index().pivot('C','A','B').dropna().index
print (common)
Int64Index([12], dtype='int64', name='C')

然后，如果要過濾原始數據，請使用boolean indexing ：

df = df[df['C'].isin(common)]
print (df)
            C
A   B        
bar two    12
foo three  12

如果想要至少在 2 組中重復的公共行解決方案是：

print (df)  
     A      B   C
0  foo    one   3
1  bar    one   4
2  foo    two   3
3  bar  three   8
4  foo    two  14
5  bar    two  12
6  foo    one  14
7  foo  three  12
8  xxx    yyy   8

df = df.groupby(['A','B']).sum()
print (df)
            C
A   B        
bar one     4
    three   8 <- dupe per bar, three
    two    12 <- dupe per bar, two
foo one    17 <-17 is duplicated per group foo, one, so omited
    three  12 <- dupe per foo, three
    two    17 <-17 is duplicated per group foo, one, so omited
xxx yyy     8 <- dupe per xxx, yyy

common1 = (df.reset_index()
             .pivot_table(index='C',columns='A', values='B', aggfunc='size')
             .notna()
             .sum(axis=1)
            )
common1 = common1.index[common1.gt(1)]
print (common1)
Int64Index([8, 12], dtype='int64', name='C')

df1 = df[df['C'].isin(common1)]
print (df1)
            C
A   B        
bar three   8
    two    12
foo three  12
xxx yyy     8

Answer 2

為了展示一個更有啟發性的例子，我在源 DataFrame 中添加了一行，以便它包含：

     A      B   C
0  foo    one   3
1  bar    one   4
2  foo    two   5
3  bar  three   8
4  foo    two  10
5  bar    two  12
6  foo    one  14
7  foo  three  12
8  xxx    yyy   8

我將分組結果保存在另一個 DataFrame 中：

df2 = df.groupby(['A','B']).sum()

所以它包含：

            C
A   B        
bar one     4
    three   8
    two    12
foo one    17
    three  12
    two    15
xxx yyy     8

如您所見， C 中有兩個重復值： 12和8 。 請注意，現在df2 中的索引是unique 。

然后，要顯示重復值及其組，請運行：

df2[df2.duplicated(keep=False)].sort_values('C')

得到：

            C
A   B        
bar three   8
xxx yyy     8
bar two    12
foo three  12

上面的結果顯示了所有重復的值和它們所在的組（ A和B ）。

在 Pandas GroupBy 中查找重復的值

問題描述

2 個解決方案

解決方案1
2 已采納 2020-01-19 06:32:42

解決方案2
2 2020-01-19 06:59:35

在 Pandas GroupBy 中查找重復的值

問題描述

2 個解決方案

解決方案1 2 已采納 2020-01-19 06:32:42

解決方案2 2 2020-01-19 06:59:35

解決方案1
2 已采納 2020-01-19 06:32:42

解決方案2
2 2020-01-19 06:59:35