简体   繁体   English

Panda groupby:计算其他列上满足条件的行?

[英]Panda groupby: counting rows satisfying condition on other columns?

I would like to do a groupby in pandas obtaining as result a dataframe that has as columns the column used to groupby , the number of elements for each group and among them, the number of elements in each group, the number of elements that does/does not satisfy a condition based on another column value.我想在groupby中做一个groupby ,结果得到一个数据框,该数据框将用于groupby的列作为列,每个组的元素数量,其中,每个组中的元素数量,执行的元素数量/不满足基于另一个列值的条件。

For example being the input like this:例如是这样的输入:

type    success
A       True
B       False
A       False
C       True

I would like something like:我想要类似的东西:

type    total    numOfSuccess numOfFailure
A       2        1             1
B       1        0             1
C       1        1             0

In pyspark I did this like在 pyspark 我这样做了

import pyspark.sql.functions as F
df = df.groupBy("type").agg(\
    F.count('*').alias('total'), \
    F.sum(F.when(F.col('success')=="true", 1).otherwise(0)).alias('numOfSuccess'),
    F.sum(F.when(F.col('success')!="true", 1).otherwise(0)).alias('numOfFails'))

while in pandas I can only get the total and numOfSuccess as:而在熊猫中,我只能得到totalnumOfSuccess为:

df_new = df.groupby(['type'], as_index=False)['success'].agg({'total':'count', 'numOfSuccess':'sum'})

or only the total as:或只有总数为:

df = df.groupby(['type']).size().reset_index(name='NumOfReqs')

but I cannot get the third column numOfFailures and plus if there is an alternative rather than summing the boolean values, it would be better since in can be extended to other cases as well easier in my opinion.但是我无法获得第三列numOfFailures并且如果有替代方法而不是对布尔值求和,那会更好,因为在我看来,in 可以扩展到其他情况也更容易。

How can I do that?我怎样才能做到这一点?

Use groupby with GroupBy.size for count all data, then for count per catogories need pivoting - with GroupBy.size and unstack , crosstab or pivot_table :使用groupbyGroupBy.size用于计算所有数据,然后每catogories计数需要旋转-与GroupBy.sizeunstackcrosstabpivot_table

df1 = df.groupby('type').size().reset_index(name='count')
df2 = (df.groupby(['type', 'success']).size().unstack(fill_value=0)
        .rename(columns={True:'numOfSuccess', False:'numOfFails'}))

Alternative for df2 : df2替代方案:

df2 = pd.crosstab(df['type'], df['success'])
        .rename(columns={True:'numOfSuccess', False:'numOfFails'}))

Or:或者:

df2 = (df.pivot_table(index='type', columns='success', fill_value=0, aggfunc='size')
        .rename(columns={True:'numOfSuccess', False:'numOfFails'}))

df_new = df1.join(df2, on='type')
print (df_new)
  type  count  numOfFails  numOfSuccess
0    A      2           1             1
1    B      1           1             0
2    C      1           0             1

Another solution is use parameter margins in crosstab and remove last row by indexing with iloc :另一种解决方案是在crosstab使用参数margins并通过iloc索引删除最后一行:

df = (pd.crosstab(df['type'], df['success'], margins=True)
        .rename(columns={True:'numOfSuccess', False:'numOfFails', 'All':'count'})
        .iloc[:-1]
        .reset_index()
        .rename_axis(None, axis=1))

print (df)
  type  numOfFails  numOfSuccess  count
0    A           1             1      2
1    B           1             0      1
2    C           0             1      1

EDIT: If possible True or False not exist, add reindex for add missing column:编辑:如果可能TrueFalse不存在,请添加reindex以添加缺失的列:

print (df)
  type  success
0    A     True
1    B     True
2    A     True
3    C     True

df1 = df.groupby('type').size().reset_index(name='count')
df2 = (df.groupby(['type', 'success']).size().unstack(fill_value=0)
         .reindex(columns=[True, False], fill_value=0)
         .rename(columns={True:'numOfSuccess', False:'numOfFails'}))


df_new = df1.join(df2, on='type')
print (df_new)
  type  count  numOfSuccess  numOfFails
0    A      2             2           0
1    B      1             1           0
2    C      1             1           0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM