简体   繁体   English

在两列中按分组依据计数唯一值

[英]Count of unique values by groupby in two columns

I want to determine the count of unique values based off two columns in a pandas df . 我想根据pandas dfcolumns来确定unique值的count

Below is an example: 下面是一个示例:

import pandas as pd

d = ({
    'B' : ['08:00:00','John','08:10:00','Gary','08:41:42','John','08:50:00','John', '09:00:00', 'Gary','09:15:00','John','09:21:00','Gary','09:30:00','Gary','09:40:00','Gary'],
    'C' : ['1','1','1','1','1','1','2','2','2', '2','2','2','3','3','3', '3','3','3'],           
    'A' : ['Stop','','Res','','Start','','Stop','','Res','','Start','','Stop','','Res','','Start','']
    })

df = pd.DataFrame(data=d)

Output: 输出:

        A         B  C
0    Stop  08:00:00  1
1              John  1
2     Res  08:10:00  1
3              Gary  1
4   Start  08:41:42  1
5              John  1
6    Stop  08:50:00  2
7              John  2
8     Res  09:00:00  2
9              Gary  2
10  Start  09:15:00  2
11             John  2
12   Stop  09:21:00  3
13             Gary  3
14    Res  09:30:00  3
15             Gary  3
16  Start  09:40:00  3
17             Gary  3

If I perform the count based of Column A and C I return the following: 如果我基于Column AC进行计数,则返回以下内容:

k = df.groupby('A').C.nunique()

Res      3
Start    3
Stop     3

I'm hoping to split those up based on the people in Column B . 我希望根据Column B的人员将其拆分。 So the intended output would be: 因此,预期的输出为:

John Stop  2
     Res   0 #Nan
     Start 2

Gary Stop  1
     Res   3 
     Start 1

I have tried k = df.groupby('A').BCnunique() 我已经尝试过k = df.groupby('A').BCnunique()

we can create a flattened DF: 我们可以创建一个扁平化的DF:

In [34]: d = pd.DataFrame(np.column_stack((df.iloc[::2], df.iloc[1::2, [0]])), columns=['time','id','op','name'])

In [35]: d
Out[35]:
       time id     op  name
0  08:00:00  1   Stop  John
1  08:10:00  1    Res  Gary
2  08:41:42  1  Start  John
3  08:50:00  2   Stop  John
4  09:00:00  2    Res  Gary
5  09:15:00  2  Start  John
6  09:21:00  3   Stop  Gary
7  09:30:00  3    Res  Gary
8  09:40:00  3  Start  Gary

prepare a multi-index, which will include all combinations: 准备一个多索引,其中将包括所有组合:

In [36]: idx = pd.MultiIndex.from_product((d.name.unique(), d.op.unique()))

and group by two columns: 并按两列分组:

In [39]: res = d.groupby(['name','op'])['id'].count().reindex(idx, fill_value=0)

In [40]: res
Out[40]:
John  Stop     2
      Res      0
      Start    2
Gary  Stop     1
      Res      3
      Start    1
Name: id, dtype: int64

Its a strange dataframe, would strongly advice to not have times and names in the same column. 它是一个奇怪的数据框,强烈建议不要在同一列中包含时间和名称。 Just add another column! 只需添加另一列! This will make things easier. 这将使事情变得容易。

Given your data, if you don't mind RES missing from John: 根据您的数据,如果您不介意约翰缺少RES

df[df==''] = None
df = df.fillna(method='ffill')
df[df['B'].isin(['Gary', 'John'])].groupby(['B', 'A']).C.nunique()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM