简体   繁体   中英

pandas dataframe count duplicates by group and date

I have a dataframe with 3 columns:

  • 'Date' unique for each row
  • 'Group' group to which each row belongs
  • 'ID' identifier for each row

In the 'ID' columns there are some duplicate values. I want to count the number of repetitions for each group and for each date.
Example of dataframe:

                   Date Group    ID
0   2021-10-29 09:15:52     B  9352
1   2021-10-29 10:58:57     A  9352
2   2021-10-29 11:20:46     C  9352
3   2021-10-29 12:47:34     C  6274
4   2021-10-29 15:41:35     C  1677
5   2021-10-29 16:12:39     B  1677
6   2021-10-29 18:57:56     B  9225
7   2021-10-29 19:46:46     C  9225
8   2021-10-30 01:23:07     C  9225
9   2021-10-30 02:13:57     A  9225
10  2021-10-30 05:03:52     B  9329
11  2021-10-30 07:48:39     B  9329
12  2021-10-30 08:45:00     A  9329
13  2021-10-30 11:17:47     C  9329
14  2021-10-30 21:46:07     C  9496
15  2021-10-30 22:13:29     A  1218
16  2021-10-31 05:39:38     C  2422
17  2021-10-31 05:39:41     C  9654
18  2021-10-31 10:10:21     A  1951
19  2021-10-31 10:19:45     A  1951
20  2021-10-31 16:40:10     A  1951
21  2021-10-31 16:41:23     C  1951
22  2021-10-31 22:07:16     A  1951
23  2021-11-01 00:26:30     C  2867
24  2021-11-01 01:25:46     B  2867
25  2021-11-01 01:53:16     B  4262
26  2021-11-01 01:58:30     A  4581
27  2021-11-01 05:23:26     C  1734
28  2021-11-01 05:38:22     C  1734
29  2021-11-01 05:47:43     C  1734
30  2021-11-01 06:49:27     A  4813
31  2021-11-01 07:54:02     C  4813
32  2021-11-01 12:10:48     C  8661
33  2021-11-01 14:32:50     C  4138
34  2021-11-01 18:38:16     B  4138
35  2021-11-01 21:33:37     C  4138
36  2021-11-02 02:34:53     B  4138
37  2021-11-02 04:45:56     C  4138
38  2021-11-02 07:33:38     C  4138
39  2021-11-02 07:40:33     C  4138
40  2021-11-02 08:06:21     B  4138
41  2021-11-02 08:32:20     C  4138
42  2021-11-02 09:47:26     A  4138
43  2021-11-02 15:51:33     C  4138
44  2021-11-02 16:04:33     B  2433
45  2021-11-02 20:47:01     B  2433
46  2021-11-03 04:11:57     A  4594
47  2021-11-03 12:36:16     A  6829
48  2021-11-03 14:42:14     A  6829
49  2021-11-03 18:03:27     B  7138
50  2021-11-03 18:46:46     C  7138
51  2021-11-03 19:07:01     C  7138
52  2021-11-03 19:23:02     A  9752
53  2021-11-03 21:10:51     A  2699
54  2021-11-04 00:58:12     C  2699
55  2021-11-04 03:44:12     A  7463
56  2021-11-04 05:40:07     C  4558
57  2021-11-04 05:56:51     C  7855
58  2021-11-04 06:27:28     C  7855
59  2021-11-04 07:50:46     C  7855

Desired result:

          Date Group  Repetitions
0   2021-10-29     A            1
1   2021-10-29     B            3
2   2021-10-29     C            3
3   2021-10-30     A            2
4   2021-10-30     B            2
5   2021-10-30     C            2
6   2021-10-31     A            4
7   2021-10-31     B            0
8   2021-10-31     C            1
9   2021-11-01     A            1
10  2021-11-01     B            2
11  2021-11-01     C            7
12  2021-11-02     A            1
13  2021-11-02     B            4
14  2021-11-02     C            5
15  2021-11-03     A            3
16  2021-11-03     B            1
17  2021-11-03     C            2
18  2021-11-04     A            0
19  2021-11-04     B            0
20  2021-11-04     C            4

Note that the duplicate conditions spans over dates and groups: the 'ID' 2699 in the example above counts as duplicate even if those repetitions belongs to different dates and groups.

IIUC, you can replace the duplicated values with 1, the others with zero, then groupby + agg + sum :

(df.assign(Date=pd.to_datetime(df['Date']).dt.normalize(),
           ID=df['ID'].duplicated(keep=False).astype(int)
          )
   .groupby(['Date', 'Group'], as_index=False).agg(repetitions=('ID', 'sum'))
)

output:

         Date Group  repetitions
0  2021-10-29     A            1
1  2021-10-29     B            3
2  2021-10-29     C            3
3  2021-10-30     A            2
4  2021-10-30     B            2
5  2021-10-30     C            2
6  2021-10-31     A            4
7  2021-10-31     C            1
8  2021-11-01     A            1
9  2021-11-01     B            2
10 2021-11-01     C            7
11 2021-11-02     A            1
12 2021-11-02     B            4
13 2021-11-02     C            5
14 2021-11-03     A            3
15 2021-11-03     B            1
16 2021-11-03     C            2
17 2021-11-04     A            0
18 2021-11-04     C            4

adding the missing combinations

dates = pd.to_datetime(df['Date']).dt.normalize()
idx = pd.MultiIndex.from_product([dates.unique(), df['Group'].unique()], names=['Date', 'Group'])

(df.assign(Date=dates,
           ID=df['ID'].duplicated(keep=False).astype(int)
          )
   .groupby(['Date', 'Group']).agg(repetitions=('ID', 'sum'))
   .reindex(idx, fill_value=0)
   .reset_index()
)

output:

         Date Group  repetitions
0  2021-10-29     B            3
1  2021-10-29     A            1
2  2021-10-29     C            3
3  2021-10-30     B            2
4  2021-10-30     A            2
5  2021-10-30     C            2
6  2021-10-31     B            0
7  2021-10-31     A            4
8  2021-10-31     C            1
9  2021-11-01     B            2
10 2021-11-01     A            1
11 2021-11-01     C            7
12 2021-11-02     B            4
13 2021-11-02     A            1
14 2021-11-02     C            5
15 2021-11-03     B            1
16 2021-11-03     A            3
17 2021-11-03     C            2
18 2021-11-04     B            0
19 2021-11-04     A            0
20 2021-11-04     C            4

df['ID'].value_counts()

应该给你每个 id 出现的次数。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM