[英]pandas: how to check that a certain value in a column repeats maximum once in each group (after groupby)
I have a pandas DataFrame which I want to group by column A, and check that a certain value ('test') in group B does not repeat more than once in each group.我有一个 pandas DataFrame,我想按 A 列分组,并检查 B 组中的某个值('test')是否在每个组中重复一次以上。
Is there a pandas native way to do the following:是否有熊猫本机方法可以执行以下操作:
1 - find the groups where 'test' appears in column B more than once? 1 - 找到“测试”多次出现在 B 列中的组?
2 - delete the additional occurrences (keep the one with the min value in column C). 2 - 删除其他事件(保留 C 列中具有最小值的事件)。
example:例子:
A B C
0 1 test 342
1 1 t 4556
2 1 te 222
3 1 test 56456
4 2 t 234525
5 2 te 123
6 2 test 23434
7 3 test 777
8 3 tes 665
if I groupby 'A', I get that 'test' appears twice in A==1, which is the case I would like to deal with.如果我按“A”分组,我会得到“测试”在 A==1 中出现两次,这是我想要处理的情况。
Solution for remove duplicated test
values by columns A,B
- keep first value per group:按A,B
列删除重复test
值的解决方案 - 保留每组的第一个值:
df = df[df.B.ne('test') | ~df.duplicated(['A','B'])]
print (df)
A B C
0 1 test 342
1 1 t 4556
2 1 te 222
4 2 t 234525
5 2 te 123
6 2 test 23434
7 3 test 777
8 3 tes 665
EDIT: If need minimal C
matched test
in B
and need all possible duplicated minimal C
values compare by GroupBy.transform
with replace C
to NaN
in Series.mask
:编辑:如果需要B
中的最小C
匹配test
,并且需要通过GroupBy.transform
比较所有可能重复的最小C
值,并在Series.mask
C
替换为NaN
:
m = df.B.ne('test')
df = df[m | ~df.C.mask(m).groupby(df['A']).transform('min').ne(df['C'])]
But if need only first duplicated test
value use DataFrameGroupBy.idxmin
with filtered DataFrame:但是如果只需要首先复制test
值,请使用DataFrameGroupBy.idxmin
和过滤后的 DataFrame:
m = df.B.ne('test')
m1 = df.index.isin(df[~m].groupby('A')['C'].idxmin())
df = df[m | m1]
Difference of solutions:解决方案的区别:
print (df)
A B C
-2 1 test 342
-1 1 test 342
0 1 test 342
1 1 t 4556
2 1 te 222
3 1 test 56456
4 2 t 234525
5 2 te 123
6 2 test 23434
7 3 test 777
8 3 tes 665
m = df.B.ne('test')
df1 = df[m | ~df.C.mask(m).groupby(df['A']).transform('min').ne(df['C'])]
print (df1)
A B C
-2 1 test 342
-1 1 test 342
0 1 test 342
1 1 t 4556
2 1 te 222
4 2 t 234525
5 2 te 123
6 2 test 23434
7 3 test 777
8 3 tes 665
m = df.B.ne('test')
m1 = df.index.isin(df[~m].groupby('A')['C'].idxmin())
df2 = df[m | m1]
print (df2)
A B C
-2 1 test 342
1 1 t 4556
2 1 te 222
4 2 t 234525
5 2 te 123
6 2 test 23434
7 3 test 777
8 3 tes 665
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.