简体   繁体   English

pandas:如何检查列中的某个值是否在每个组中重复最大值一次(在 groupby 之后)

[英]pandas: how to check that a certain value in a column repeats maximum once in each group (after groupby)

I have a pandas DataFrame which I want to group by column A, and check that a certain value ('test') in group B does not repeat more than once in each group.我有一个 pandas DataFrame,我想按 A 列分组,并检查 B 组中的某个值('test')是否在每个组中重复一次以上。

Is there a pandas native way to do the following:是否有熊猫本机方法可以执行以下操作:
1 - find the groups where 'test' appears in column B more than once? 1 - 找到“测试”多次出现在 B 列中的组?
2 - delete the additional occurrences (keep the one with the min value in column C). 2 - 删除其他事件(保留 C 列中具有最小值的事件)。

example:例子:

    A   B       C
0   1   test    342
1   1   t       4556
2   1   te      222
3   1   test    56456
4   2   t       234525
5   2   te      123
6   2   test    23434
7   3   test    777
8   3   tes     665

if I groupby 'A', I get that 'test' appears twice in A==1, which is the case I would like to deal with.如果我按“A”分组,我会得到“测试”在 A==1 中出现两次,这是我想要处理的情况。

Solution for remove duplicated test values by columns A,B - keep first value per group:A,B列删除重复test值的解决方案 - 保留每组的第一个值:

df = df[df.B.ne('test') | ~df.duplicated(['A','B'])]
print (df)
   A     B       C
0  1  test     342
1  1     t    4556
2  1    te     222
4  2     t  234525
5  2    te     123
6  2  test   23434
7  3  test     777
8  3   tes     665

EDIT: If need minimal C matched test in B and need all possible duplicated minimal C values compare by GroupBy.transform with replace C to NaN in Series.mask :编辑:如果需要B中的最小C匹配test ,并且需要通过GroupBy.transform比较所有可能重复的最小C值,并在Series.mask C替换为NaN

m = df.B.ne('test')
df = df[m | ~df.C.mask(m).groupby(df['A']).transform('min').ne(df['C'])]

But if need only first duplicated test value use DataFrameGroupBy.idxmin with filtered DataFrame:但是如果只需要首先复制test值,请使用DataFrameGroupBy.idxmin和过滤后的 DataFrame:

m = df.B.ne('test')
m1 = df.index.isin(df[~m].groupby('A')['C'].idxmin())

df = df[m | m1]

Difference of solutions:解决方案的区别:

print (df)
    A     B       C
-2  1  test     342
-1  1  test     342
 0  1  test     342
 1  1     t    4556
 2  1    te     222
 3  1  test   56456
 4  2     t  234525
 5  2    te     123
 6  2  test   23434
 7  3  test     777
 8  3   tes     665
 
m = df.B.ne('test')
df1 = df[m | ~df.C.mask(m).groupby(df['A']).transform('min').ne(df['C'])]
print (df1)
    A     B       C
-2  1  test     342
-1  1  test     342
 0  1  test     342
 1  1     t    4556
 2  1    te     222
 4  2     t  234525
 5  2    te     123
 6  2  test   23434
 7  3  test     777
 8  3   tes     665

m = df.B.ne('test')
m1 = df.index.isin(df[~m].groupby('A')['C'].idxmin())

df2 = df[m | m1]
print (df2)
    A     B       C
-2  1  test     342
 1  1     t    4556
 2  1    te     222
 4  2     t  234525
 5  2    te     123
 6  2  test   23434
 7  3  test     777
 8  3   tes     665

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM