[英]Count of unique values by groupby in two columns
I want to determine the count
of unique
values based off two columns
in a pandas
df
. 我想根据pandas
df
两columns
来确定unique
值的count
。
Below is an example: 下面是一个示例:
import pandas as pd
d = ({
'B' : ['08:00:00','John','08:10:00','Gary','08:41:42','John','08:50:00','John', '09:00:00', 'Gary','09:15:00','John','09:21:00','Gary','09:30:00','Gary','09:40:00','Gary'],
'C' : ['1','1','1','1','1','1','2','2','2', '2','2','2','3','3','3', '3','3','3'],
'A' : ['Stop','','Res','','Start','','Stop','','Res','','Start','','Stop','','Res','','Start','']
})
df = pd.DataFrame(data=d)
Output: 输出:
A B C
0 Stop 08:00:00 1
1 John 1
2 Res 08:10:00 1
3 Gary 1
4 Start 08:41:42 1
5 John 1
6 Stop 08:50:00 2
7 John 2
8 Res 09:00:00 2
9 Gary 2
10 Start 09:15:00 2
11 John 2
12 Stop 09:21:00 3
13 Gary 3
14 Res 09:30:00 3
15 Gary 3
16 Start 09:40:00 3
17 Gary 3
If I perform the count based of Column A
and C
I return the following: 如果我基于Column A
和C
进行计数,则返回以下内容:
k = df.groupby('A').C.nunique()
Res 3
Start 3
Stop 3
I'm hoping to split those up based on the people in Column B
. 我希望根据Column B
的人员将其拆分。 So the intended output would be: 因此,预期的输出为:
John Stop 2
Res 0 #Nan
Start 2
Gary Stop 1
Res 3
Start 1
I have tried k = df.groupby('A').BCnunique()
我已经尝试过k = df.groupby('A').BCnunique()
we can create a flattened DF: 我们可以创建一个扁平化的DF:
In [34]: d = pd.DataFrame(np.column_stack((df.iloc[::2], df.iloc[1::2, [0]])), columns=['time','id','op','name'])
In [35]: d
Out[35]:
time id op name
0 08:00:00 1 Stop John
1 08:10:00 1 Res Gary
2 08:41:42 1 Start John
3 08:50:00 2 Stop John
4 09:00:00 2 Res Gary
5 09:15:00 2 Start John
6 09:21:00 3 Stop Gary
7 09:30:00 3 Res Gary
8 09:40:00 3 Start Gary
prepare a multi-index, which will include all combinations: 准备一个多索引,其中将包括所有组合:
In [36]: idx = pd.MultiIndex.from_product((d.name.unique(), d.op.unique()))
and group by two columns: 并按两列分组:
In [39]: res = d.groupby(['name','op'])['id'].count().reindex(idx, fill_value=0)
In [40]: res
Out[40]:
John Stop 2
Res 0
Start 2
Gary Stop 1
Res 3
Start 1
Name: id, dtype: int64
Its a strange dataframe, would strongly advice to not have times and names in the same column. 它是一个奇怪的数据框,强烈建议不要在同一列中包含时间和名称。 Just add another column! 只需添加另一列! This will make things easier. 这将使事情变得容易。
Given your data, if you don't mind RES
missing from John: 根据您的数据,如果您不介意约翰缺少RES
:
df[df==''] = None
df = df.fillna(method='ffill')
df[df['B'].isin(['Gary', 'John'])].groupby(['B', 'A']).C.nunique()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.