简体   繁体   中英

Count of unique values by groupby in two columns

I want to determine the count of unique values based off two columns in a pandas df .

Below is an example:

import pandas as pd

d = ({
    'B' : ['08:00:00','John','08:10:00','Gary','08:41:42','John','08:50:00','John', '09:00:00', 'Gary','09:15:00','John','09:21:00','Gary','09:30:00','Gary','09:40:00','Gary'],
    'C' : ['1','1','1','1','1','1','2','2','2', '2','2','2','3','3','3', '3','3','3'],           
    'A' : ['Stop','','Res','','Start','','Stop','','Res','','Start','','Stop','','Res','','Start','']
    })

df = pd.DataFrame(data=d)

Output:

        A         B  C
0    Stop  08:00:00  1
1              John  1
2     Res  08:10:00  1
3              Gary  1
4   Start  08:41:42  1
5              John  1
6    Stop  08:50:00  2
7              John  2
8     Res  09:00:00  2
9              Gary  2
10  Start  09:15:00  2
11             John  2
12   Stop  09:21:00  3
13             Gary  3
14    Res  09:30:00  3
15             Gary  3
16  Start  09:40:00  3
17             Gary  3

If I perform the count based of Column A and C I return the following:

k = df.groupby('A').C.nunique()

Res      3
Start    3
Stop     3

I'm hoping to split those up based on the people in Column B . So the intended output would be:

John Stop  2
     Res   0 #Nan
     Start 2

Gary Stop  1
     Res   3 
     Start 1

I have tried k = df.groupby('A').BCnunique()

we can create a flattened DF:

In [34]: d = pd.DataFrame(np.column_stack((df.iloc[::2], df.iloc[1::2, [0]])), columns=['time','id','op','name'])

In [35]: d
Out[35]:
       time id     op  name
0  08:00:00  1   Stop  John
1  08:10:00  1    Res  Gary
2  08:41:42  1  Start  John
3  08:50:00  2   Stop  John
4  09:00:00  2    Res  Gary
5  09:15:00  2  Start  John
6  09:21:00  3   Stop  Gary
7  09:30:00  3    Res  Gary
8  09:40:00  3  Start  Gary

prepare a multi-index, which will include all combinations:

In [36]: idx = pd.MultiIndex.from_product((d.name.unique(), d.op.unique()))

and group by two columns:

In [39]: res = d.groupby(['name','op'])['id'].count().reindex(idx, fill_value=0)

In [40]: res
Out[40]:
John  Stop     2
      Res      0
      Start    2
Gary  Stop     1
      Res      3
      Start    1
Name: id, dtype: int64

Its a strange dataframe, would strongly advice to not have times and names in the same column. Just add another column! This will make things easier.

Given your data, if you don't mind RES missing from John:

df[df==''] = None
df = df.fillna(method='ffill')
df[df['B'].isin(['Gary', 'John'])].groupby(['B', 'A']).C.nunique()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM