简体   繁体   English

根据其他列从 Pandas dataframe 中删除重复项

[英]Remove duplicated from Pandas dataframe based on other columns

I am working on a Pandas grouped dataset which looks like below:我正在研究一个 Pandas 分组数据集,如下所示:

test_identifier timestamp   Count_of_Fail_tests test_status
test1   22-08-2022 07:00    0   pass
        23-08-2022 07:00    0   pass
        24-08-2022 07:00    0   pass
        25-08-2022 07:00    0   pass
        26-08-2022 07:00    0   pass
test10  22-08-2022 07:00    0   pass
        23-08-2022 07:00    0   pass
        24-08-2022 07:00    0   pass
        25-08-2022 07:00    0   pass
        26-08-2022 07:00    0   pass
test2   22-08-2022 07:00    2   fail
        23-08-2022 07:00    2   pass
        24-08-2022 07:00    2   pass
        25-08-2022 07:00    2   fail
        26-08-2022 07:00    2   pass
test3   22-08-2022 07:00    0   pass
        23-08-2022 07:00    0   pass
        24-08-2022 07:00    0   pass
        25-08-2022 07:00    0   pass
        26-08-2022 07:00    0   pass
test4   22-08-2022 07:00    1   pass
        23-08-2022 07:00    1   pass
        24-08-2022 07:00    1   pass
        25-08-2022 07:00    1   pass
        26-08-2022 07:00    1   fail

I want to sort or remove duplicates for the Count_of_Fail_tests column so that, the output looks like below:我想对Count_of_Fail_tests列进行排序或删除重复项,以便 output 如下所示:

test_identifier timestamp   Count_of_Fail_tests test_status
test1   22-08-2022 07:00    0   pass
        23-08-2022 07:00        pass
        24-08-2022 07:00        pass
        25-08-2022 07:00        pass
        26-08-2022 07:00        pass
test10  22-08-2022 07:00        pass
        23-08-2022 07:00        pass
        24-08-2022 07:00        pass
        25-08-2022 07:00        pass
        26-08-2022 07:00        pass
test2   22-08-2022 07:00    2   fail
        23-08-2022 07:00        pass
        24-08-2022 07:00        pass
        25-08-2022 07:00        fail
        26-08-2022 07:00        pass
test3   22-08-2022 07:00    0   pass
        23-08-2022 07:00        pass
        24-08-2022 07:00        pass
        25-08-2022 07:00        pass
        26-08-2022 07:00        pass
test4   22-08-2022 07:00        pass
        23-08-2022 07:00        pass
        24-08-2022 07:00        pass
        25-08-2022 07:00        pass
        26-08-2022 07:00    1   fail

If the column test_status is removed that also ok.Basically the goal is to find number of failed runs for each testcases per day如果删除列test_status也可以。基本上目标是查找每天每个测试用例的失败运行次数

Original Dataframe:原装 Dataframe:

    timestamp   test_identifier test_status
0   22-08-2022 07:00    test1   pass
1   22-08-2022 07:00    test2   fail
2   22-08-2022 07:00    test3   pass
3   22-08-2022 07:00    test4   pass
4   22-08-2022 07:00    test5   pass
5   22-08-2022 07:00    test6   pass
6   22-08-2022 07:00    test7   fail
7   22-08-2022 07:00    test8   pass
8   22-08-2022 07:00    test9   pass
9   22-08-2022 07:00    test10  pass
10  23-08-2022 07:00    test1   pass
11  23-08-2022 07:00    test2   pass
12  23-08-2022 07:00    test3   pass
13  23-08-2022 07:00    test4   pass
14  23-08-2022 07:00    test5   fail
15  23-08-2022 07:00    test6   pass
16  23-08-2022 07:00    test7   pass
17  23-08-2022 07:00    test8   pass
18  23-08-2022 07:00    test9   pass
19  23-08-2022 07:00    test10  pass
20  24-08-2022 07:00    test1   pass
21  24-08-2022 07:00    test2   pass
22  24-08-2022 07:00    test3   pass
23  24-08-2022 07:00    test4   pass
24  24-08-2022 07:00    test5   pass
25  24-08-2022 07:00    test6   pass
26  24-08-2022 07:00    test7   pass
27  24-08-2022 07:00    test8   pass
28  24-08-2022 07:00    test9   pass
29  24-08-2022 07:00    test10  pass
30  25-08-2022 07:00    test1   pass
31  25-08-2022 07:00    test2   fail
32  25-08-2022 07:00    test3   pass
33  25-08-2022 07:00    test4   pass
34  25-08-2022 07:00    test5   pass
35  25-08-2022 07:00    test6   pass
36  25-08-2022 07:00    test7   fail
37  25-08-2022 07:00    test8   pass
38  25-08-2022 07:00    test9   pass
39  25-08-2022 07:00    test10  pass
40  26-08-2022 07:00    test1   pass
41  26-08-2022 07:00    test2   pass
42  26-08-2022 07:00    test3   pass
43  26-08-2022 07:00    test4   fail
44  26-08-2022 07:00    test5   pass
45  26-08-2022 07:00    test6   pass
46  26-08-2022 07:00    test7   pass
47  26-08-2022 07:00    test8   pass
48  26-08-2022 07:00    test9   pass
49  26-08-2022 07:00    test10  pass

This should give you the desired results.这应该会给你想要的结果。 It will hid all the "passes" from your df and only count the fails per date/test paring它将从您的 df 中隐藏所有“通过”,并且仅计算每个日期/测试配对的失败次数

df.mask(df['test_status'].eq('pass')).groupby(['timestamp', 'test_identifier'], as_index = False)['test_status'].count()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM