[英]How to aggregate unique count with pandas pivot_table
This code:这段代码:
df2 = (
pd.DataFrame({
'X' : ['X1', 'X1', 'X1', 'X1'],
'Y' : ['Y2', 'Y1', 'Y1', 'Y1'],
'Z' : ['Z3', 'Z1', 'Z1', 'Z2']
})
)
g = df2.groupby('X')
pd.pivot_table(g, values='X', rows='Y', cols='Z', margins=False, aggfunc='count')
returns the following error:返回以下错误:
Traceback (most recent call last): ...
AttributeError: 'Index' object has no attribute 'index'
How do I get a Pivot Table with counts of unique values of one DataFrame column for two other columns?如何获得一个 Pivot 表,其中一个 DataFrame 列的唯一值计数为另外两列?
Is there aggfunc
for count unique?是否有用于计数唯一的aggfunc
? Should I be using np.bincount()
?我应该使用np.bincount()
吗?
NB.注意。 I am aware of pandas.Series.values_counts()
however I need a pivot table.我知道pandas.Series.values_counts()
但是我需要一个 pivot 表。
EDIT: The output should be:编辑: output 应该是:
Z Z1 Z2 Z3
Y
Y1 1 1 NaN
Y2 NaN NaN 1
Do you mean something like this?你的意思是这样的吗?
>>> df2.pivot_table(values='X', rows='Y', cols='Z', aggfunc=lambda x: len(x.unique()))
Z Z1 Z2 Z3
Y
Y1 1 1 NaN
Y2 NaN NaN 1
Note that using len
assumes you don't have NA
s in your DataFrame.请注意,使用len
假设您的 DataFrame 中没有NA
。 You can do x.value_counts().count()
or len(x.dropna().unique())
otherwise.否则,您可以执行x.value_counts().count()
或len(x.dropna().unique())
。
This is a good way of counting entries within .pivot_table
:这是在.pivot_table
中计算条目的好方法:
>>> df2.pivot_table(values='X', index=['Y','Z'], columns='X', aggfunc='count')
X1 X2
Y Z
Y1 Z1 1 1
Z2 1 NaN
Y2 Z3 1 NaN
Since at least version 0.16 of pandas, it does not take the parameter "rows"由于至少版本 0.16 的熊猫,它不带参数“行”
As of 0.23, the solution would be:从 0.23 开始,解决方案是:
df2.pivot_table(values='X', index='Y', columns='Z', aggfunc=pd.Series.nunique)
which returns:返回:
Z Z1 Z2 Z3
Y
Y1 1.0 1.0 NaN
Y2 NaN NaN 1.0
aggfunc=pd.Series.nunique
provides distinct count. aggfunc=pd.Series.nunique
提供不同的计数。 Full code is following:完整代码如下:
df2.pivot_table(values='X', rows='Y', cols='Z', aggfunc=pd.Series.nunique)
Credit to @hume for this solution (see comment under the accepted answer).此解决方案归功于@hume(请参阅已接受答案下的评论)。 Adding as an answer here for better discoverability.在此处添加答案以提高可发现性。
aggfunc
parameter in pandas.DataFrame.pivot_table
will take 'nunique'
as a string
, or in a list
pandas.DataFrame.pivot_table
中的aggfunc
参数将'nunique'
作为string
或list
pandas 1.3.1
在pandas 1.3.1
中测试out = df2.pivot_table(values='X', index='Y', columns='Z', aggfunc=['nunique', 'count', lambda x: len(x.unique()), len])
[out]:
nunique count <lambda> len
Z Z1 Z2 Z3 Z1 Z2 Z3 Z1 Z2 Z3 Z1 Z2 Z3
Y
Y1 1.0 1.0 NaN 2.0 1.0 NaN 1.0 1.0 NaN 2.0 1.0 NaN
Y2 NaN NaN 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN NaN 1.0
out = df2.pivot_table(values='X', index='Y', columns='Z', aggfunc='nunique')
[out]:
Z Z1 Z2 Z3
Y
Y1 1.0 1.0 NaN
Y2 NaN NaN 1.0
out = df2.pivot_table(values='X', index='Y', columns='Z', aggfunc=['nunique'])
[out]:
nunique
Z Z1 Z2 Z3
Y
Y1 1.0 1.0 NaN
Y2 NaN NaN 1.0
You can construct a pivot table for each distinct value of X
.您可以为X
每个不同值构建一个数据透视表。 In this case,在这种情况下,
for xval, xgroup in g:
ptable = pd.pivot_table(xgroup, rows='Y', cols='Z',
margins=False, aggfunc=numpy.size)
will construct a pivot table for each value of X
.将为X
每个值构建一个数据透视表。 You may want to index ptable
using the xvalue
.您可能希望使用xvalue
索引ptable
。 With this code, I get (for X1
)使用此代码,我得到(对于X1
)
X
Z Z1 Z2 Z3
Y
Y1 2 1 NaN
Y2 NaN NaN 1
Since none of the answers are up to date with the last version of Pandas, I am writing another solution for this problem:由于最新版本的 Pandas 没有一个答案是最新的,我正在为这个问题编写另一个解决方案:
import pandas as pd
# Set example
df2 = (
pd.DataFrame({
'X' : ['X1', 'X1', 'X1', 'X1'],
'Y' : ['Y2', 'Y1', 'Y1', 'Y1'],
'Z' : ['Z3', 'Z1', 'Z1', 'Z2']
})
)
# Pivot
pd.crosstab(index=df2['Y'], columns=df2['Z'], values=df2['X'], aggfunc=pd.Series.nunique)
which returns:返回:
Z Z1 Z2 Z3
Y
Y1 1.0 1.0 NaN
Y2 NaN NaN 1.0
For best performance I recommend doing DataFrame.drop_duplicates
followed up aggfunc='count'
.为了获得最佳性能,我建议在DataFrame.drop_duplicates
之后执行aggfunc='count'
。
Others are correct that aggfunc=pd.Series.nunique
will work.其他人是正确的aggfunc=pd.Series.nunique
将工作。 This can be slow, however, if the number of index
groups you have is large (>1000).但是,如果您拥有的index
组数量很大 (>1000),这可能会很慢。
So instead of (to quote @Javier)所以而不是(引用@Javier)
df2.pivot_table('X', 'Y', 'Z', aggfunc=pd.Series.nunique)
I suggest我建议
df2.drop_duplicates(['X', 'Y', 'Z']).pivot_table('X', 'Y', 'Z', aggfunc='count')
This works because it guarantees that every subgroup (each combination of ('Y', 'Z')
) will have unique (non-duplicate) values of 'X'
.这是有效的,因为它保证每个子组( ('Y', 'Z')
每个组合)将具有'X'
唯一(非重复)值。
aggfunc=pd.Series.nunique
will only count unique values for a series - in this case count the unique values for a column. aggfunc=pd.Series.nunique
将只计算一个系列的唯一值 - 在这种情况下计算列的唯一值。 But this doesn't quite reflect as an alternative to aggfunc='count'
但这并不能完全反映作为aggfunc='count'
的替代方案
For simple counting, it better to use aggfunc=pd.Series.count
对于简单的计数,最好使用aggfunc=pd.Series.count
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.