[英]Panda groupby: counting rows satisfying condition on other columns?
I would like to do a groupby
in pandas obtaining as result a dataframe that has as columns the column used to groupby
, the number of elements for each group and among them, the number of elements in each group, the number of elements that does/does not satisfy a condition based on another column value.我想在
groupby
中做一个groupby
,结果得到一个数据框,该数据框将用于groupby
的列作为列,每个组的元素数量,其中,每个组中的元素数量,执行的元素数量/不满足基于另一个列值的条件。
For example being the input like this:例如是这样的输入:
type success
A True
B False
A False
C True
I would like something like:我想要类似的东西:
type total numOfSuccess numOfFailure
A 2 1 1
B 1 0 1
C 1 1 0
In pyspark I did this like在 pyspark 我这样做了
import pyspark.sql.functions as F
df = df.groupBy("type").agg(\
F.count('*').alias('total'), \
F.sum(F.when(F.col('success')=="true", 1).otherwise(0)).alias('numOfSuccess'),
F.sum(F.when(F.col('success')!="true", 1).otherwise(0)).alias('numOfFails'))
while in pandas I can only get the total
and numOfSuccess
as:而在熊猫中,我只能得到
total
和numOfSuccess
为:
df_new = df.groupby(['type'], as_index=False)['success'].agg({'total':'count', 'numOfSuccess':'sum'})
or only the total as:或只有总数为:
df = df.groupby(['type']).size().reset_index(name='NumOfReqs')
but I cannot get the third column numOfFailures
and plus if there is an alternative rather than summing the boolean values, it would be better since in can be extended to other cases as well easier in my opinion.但是我无法获得第三列
numOfFailures
并且如果有替代方法而不是对布尔值求和,那会更好,因为在我看来,in 可以扩展到其他情况也更容易。
How can I do that?我怎样才能做到这一点?
Use groupby
with GroupBy.size
for count all data, then for count per catogories need pivoting - with GroupBy.size
and unstack
, crosstab
or pivot_table
:使用
groupby
与GroupBy.size
用于计算所有数据,然后每catogories计数需要旋转-与GroupBy.size
和unstack
, crosstab
或pivot_table
:
df1 = df.groupby('type').size().reset_index(name='count')
df2 = (df.groupby(['type', 'success']).size().unstack(fill_value=0)
.rename(columns={True:'numOfSuccess', False:'numOfFails'}))
Alternative for df2
: df2
替代方案:
df2 = pd.crosstab(df['type'], df['success'])
.rename(columns={True:'numOfSuccess', False:'numOfFails'}))
Or:或者:
df2 = (df.pivot_table(index='type', columns='success', fill_value=0, aggfunc='size')
.rename(columns={True:'numOfSuccess', False:'numOfFails'}))
df_new = df1.join(df2, on='type')
print (df_new)
type count numOfFails numOfSuccess
0 A 2 1 1
1 B 1 1 0
2 C 1 0 1
Another solution is use parameter margins
in crosstab
and remove last row by indexing with iloc
:另一种解决方案是在
crosstab
使用参数margins
并通过iloc
索引删除最后一行:
df = (pd.crosstab(df['type'], df['success'], margins=True)
.rename(columns={True:'numOfSuccess', False:'numOfFails', 'All':'count'})
.iloc[:-1]
.reset_index()
.rename_axis(None, axis=1))
print (df)
type numOfFails numOfSuccess count
0 A 1 1 2
1 B 1 0 1
2 C 0 1 1
EDIT: If possible True
or False
not exist, add reindex
for add missing column:编辑:如果可能
True
或False
不存在,请添加reindex
以添加缺失的列:
print (df)
type success
0 A True
1 B True
2 A True
3 C True
df1 = df.groupby('type').size().reset_index(name='count')
df2 = (df.groupby(['type', 'success']).size().unstack(fill_value=0)
.reindex(columns=[True, False], fill_value=0)
.rename(columns={True:'numOfSuccess', False:'numOfFails'}))
df_new = df1.join(df2, on='type')
print (df_new)
type count numOfSuccess numOfFails
0 A 2 2 0
1 B 1 1 0
2 C 1 1 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.