[英]Remove duplicated from Pandas dataframe based on other columns
I am working on a Pandas grouped dataset which looks like below:我正在研究一个 Pandas 分组数据集,如下所示:
test_identifier timestamp Count_of_Fail_tests test_status
test1 22-08-2022 07:00 0 pass
23-08-2022 07:00 0 pass
24-08-2022 07:00 0 pass
25-08-2022 07:00 0 pass
26-08-2022 07:00 0 pass
test10 22-08-2022 07:00 0 pass
23-08-2022 07:00 0 pass
24-08-2022 07:00 0 pass
25-08-2022 07:00 0 pass
26-08-2022 07:00 0 pass
test2 22-08-2022 07:00 2 fail
23-08-2022 07:00 2 pass
24-08-2022 07:00 2 pass
25-08-2022 07:00 2 fail
26-08-2022 07:00 2 pass
test3 22-08-2022 07:00 0 pass
23-08-2022 07:00 0 pass
24-08-2022 07:00 0 pass
25-08-2022 07:00 0 pass
26-08-2022 07:00 0 pass
test4 22-08-2022 07:00 1 pass
23-08-2022 07:00 1 pass
24-08-2022 07:00 1 pass
25-08-2022 07:00 1 pass
26-08-2022 07:00 1 fail
I want to sort or remove duplicates for the Count_of_Fail_tests
column so that, the output looks like below:我想对Count_of_Fail_tests
列进行排序或删除重复项,以便 output 如下所示:
test_identifier timestamp Count_of_Fail_tests test_status
test1 22-08-2022 07:00 0 pass
23-08-2022 07:00 pass
24-08-2022 07:00 pass
25-08-2022 07:00 pass
26-08-2022 07:00 pass
test10 22-08-2022 07:00 pass
23-08-2022 07:00 pass
24-08-2022 07:00 pass
25-08-2022 07:00 pass
26-08-2022 07:00 pass
test2 22-08-2022 07:00 2 fail
23-08-2022 07:00 pass
24-08-2022 07:00 pass
25-08-2022 07:00 fail
26-08-2022 07:00 pass
test3 22-08-2022 07:00 0 pass
23-08-2022 07:00 pass
24-08-2022 07:00 pass
25-08-2022 07:00 pass
26-08-2022 07:00 pass
test4 22-08-2022 07:00 pass
23-08-2022 07:00 pass
24-08-2022 07:00 pass
25-08-2022 07:00 pass
26-08-2022 07:00 1 fail
If the column test_status
is removed that also ok.Basically the goal is to find number of failed runs for each testcases per day如果删除列test_status
也可以。基本上目标是查找每天每个测试用例的失败运行次数
Original Dataframe:原装 Dataframe:
timestamp test_identifier test_status
0 22-08-2022 07:00 test1 pass
1 22-08-2022 07:00 test2 fail
2 22-08-2022 07:00 test3 pass
3 22-08-2022 07:00 test4 pass
4 22-08-2022 07:00 test5 pass
5 22-08-2022 07:00 test6 pass
6 22-08-2022 07:00 test7 fail
7 22-08-2022 07:00 test8 pass
8 22-08-2022 07:00 test9 pass
9 22-08-2022 07:00 test10 pass
10 23-08-2022 07:00 test1 pass
11 23-08-2022 07:00 test2 pass
12 23-08-2022 07:00 test3 pass
13 23-08-2022 07:00 test4 pass
14 23-08-2022 07:00 test5 fail
15 23-08-2022 07:00 test6 pass
16 23-08-2022 07:00 test7 pass
17 23-08-2022 07:00 test8 pass
18 23-08-2022 07:00 test9 pass
19 23-08-2022 07:00 test10 pass
20 24-08-2022 07:00 test1 pass
21 24-08-2022 07:00 test2 pass
22 24-08-2022 07:00 test3 pass
23 24-08-2022 07:00 test4 pass
24 24-08-2022 07:00 test5 pass
25 24-08-2022 07:00 test6 pass
26 24-08-2022 07:00 test7 pass
27 24-08-2022 07:00 test8 pass
28 24-08-2022 07:00 test9 pass
29 24-08-2022 07:00 test10 pass
30 25-08-2022 07:00 test1 pass
31 25-08-2022 07:00 test2 fail
32 25-08-2022 07:00 test3 pass
33 25-08-2022 07:00 test4 pass
34 25-08-2022 07:00 test5 pass
35 25-08-2022 07:00 test6 pass
36 25-08-2022 07:00 test7 fail
37 25-08-2022 07:00 test8 pass
38 25-08-2022 07:00 test9 pass
39 25-08-2022 07:00 test10 pass
40 26-08-2022 07:00 test1 pass
41 26-08-2022 07:00 test2 pass
42 26-08-2022 07:00 test3 pass
43 26-08-2022 07:00 test4 fail
44 26-08-2022 07:00 test5 pass
45 26-08-2022 07:00 test6 pass
46 26-08-2022 07:00 test7 pass
47 26-08-2022 07:00 test8 pass
48 26-08-2022 07:00 test9 pass
49 26-08-2022 07:00 test10 pass
This should give you the desired results.这应该会给你想要的结果。 It will hid all the "passes" from your df and only count the fails per date/test paring它将从您的 df 中隐藏所有“通过”,并且仅计算每个日期/测试配对的失败次数
df.mask(df['test_status'].eq('pass')).groupby(['timestamp', 'test_identifier'], as_index = False)['test_status'].count()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.