Panda groupby：计算其他列上满足条件的行？

Question

I would like to do a groupby in pandas obtaining as result a dataframe that has as columns the column used to groupby , the number of elements for each group and among them, the number of elements in each group, the number of elements that does/does not satisfy a condition based on another column value.我想在groupby中做一个groupby ，结果得到一个数据框，该数据框将用于groupby的列作为列，每个组的元素数量，其中，每个组中的元素数量，执行的元素数量/不满足基于另一个列值的条件。

For example being the input like this:例如是这样的输入：

type    success
A       True
B       False
A       False
C       True

I would like something like:我想要类似的东西：

type    total    numOfSuccess numOfFailure
A       2        1             1
B       1        0             1
C       1        1             0

In pyspark I did this like在 pyspark 我这样做了

import pyspark.sql.functions as F
df = df.groupBy("type").agg(\
    F.count('*').alias('total'), \
    F.sum(F.when(F.col('success')=="true", 1).otherwise(0)).alias('numOfSuccess'),
    F.sum(F.when(F.col('success')!="true", 1).otherwise(0)).alias('numOfFails'))

while in pandas I can only get the total and numOfSuccess as:而在熊猫中，我只能得到total和numOfSuccess为：

df_new = df.groupby(['type'], as_index=False)['success'].agg({'total':'count', 'numOfSuccess':'sum'})

or only the total as:或只有总数为：

df = df.groupby(['type']).size().reset_index(name='NumOfReqs')

but I cannot get the third column numOfFailures and plus if there is an alternative rather than summing the boolean values, it would be better since in can be extended to other cases as well easier in my opinion.但是我无法获得第三列numOfFailures并且如果有替代方法而不是对布尔值求和，那会更好，因为在我看来，in 可以扩展到其他情况也更容易。

How can I do that?我怎样才能做到这一点？

Answer 1

Use groupby with GroupBy.size for count all data, then for count per catogories need pivoting - with GroupBy.size and unstack , crosstab or pivot_table :使用groupby与GroupBy.size用于计算所有数据，然后每catogories计数需要旋转-与GroupBy.size和unstack ， crosstab或pivot_table ：

df1 = df.groupby('type').size().reset_index(name='count')
df2 = (df.groupby(['type', 'success']).size().unstack(fill_value=0)
        .rename(columns={True:'numOfSuccess', False:'numOfFails'}))

Alternative for df2 : df2替代方案：

df2 = pd.crosstab(df['type'], df['success'])
        .rename(columns={True:'numOfSuccess', False:'numOfFails'}))

Or:或者：

df2 = (df.pivot_table(index='type', columns='success', fill_value=0, aggfunc='size')
        .rename(columns={True:'numOfSuccess', False:'numOfFails'}))

df_new = df1.join(df2, on='type')
print (df_new)
  type  count  numOfFails  numOfSuccess
0    A      2           1             1
1    B      1           1             0
2    C      1           0             1

Another solution is use parameter margins in crosstab and remove last row by indexing with iloc :另一种解决方案是在crosstab使用参数margins并通过iloc索引删除最后一行：

df = (pd.crosstab(df['type'], df['success'], margins=True)
        .rename(columns={True:'numOfSuccess', False:'numOfFails', 'All':'count'})
        .iloc[:-1]
        .reset_index()
        .rename_axis(None, axis=1))

print (df)
  type  numOfFails  numOfSuccess  count
0    A           1             1      2
1    B           1             0      1
2    C           0             1      1

EDIT: If possible True or False not exist, add reindex for add missing column:编辑：如果可能True或False不存在，请添加reindex以添加缺失的列：

print (df)
  type  success
0    A     True
1    B     True
2    A     True
3    C     True

df1 = df.groupby('type').size().reset_index(name='count')
df2 = (df.groupby(['type', 'success']).size().unstack(fill_value=0)
         .reindex(columns=[True, False], fill_value=0)
         .rename(columns={True:'numOfSuccess', False:'numOfFails'}))


df_new = df1.join(df2, on='type')
print (df_new)
  type  count  numOfSuccess  numOfFails
0    A      2             2           0
1    B      1             1           0
2    C      1             1           0

Panda groupby：计算其他列上满足条件的行？

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-01-22 11:16:18

Panda groupby：计算其他列上满足条件的行？

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-01-22 11:16:18

解决方案1
2 已采纳 2019-01-22 11:16:18