如何使用，按条件分组，然后在 Pandas DataFrame 中进行计数

Question

I have the following dataset:我有以下数据集：

df = pd.DataFrame(
    [
        ['John', 3, Yes],
        ['John', 4, No],
        ['Alex', 2, No],
        ['Alex', 6, No],
        ['John', 7, No],
        ['John', 2, Yes],
        ['Alex', 1, Yes]
    ], columns = ['Name', 'TestType','Test'])

Giving me:给我：

print(df):

Name        TestType          Test
John         3                 Yes
John         4                 No
Alex         2                 No
Alex         6                 No
John         7                 No
John         2                 Yes 
Alex         1                 Yes

The table is in Chronological order so what I am trying to achieve is an up to date count of tests taken where TestType is less than 5 and a percentage count of tests taken by a person where TestType is less than 5.该表是按时间顺序排列的，所以我想要实现的是在TestType小于 5 的情况下进行的测试的最新计数以及在TestType小于 5 的情况下进行的测试的百分比计数。

I am hoping for the output to be:我希望 output 是：

print (df):
Name        TestType          Test       TestsUnder5      TestPCunder5
John         3                 Yes            1              100%
John         4                 No             2              50%
Alex         2                 No             1              0%
Alex         6                 Yes            1              0% 
John         7                 No             2              50%
John         2                 Yes            3              67%
Alex         1                 Yes            2              50%

I think I need to use groupby and cumsum but not sure how to specify the condition and then perform the calculation.我想我需要使用groupby和cumsum但不确定如何指定条件然后执行计算。 Any help would be much appreciated!任何帮助将非常感激！

Answer 1

Almost there, You can apply math operators to boolean series.差不多了，您可以将数学运算符应用于 boolean 系列。 which coerce them to integers 0 or 1: For TestsUnder5 it looks like this might work:将它们强制为整数 0 或 1：对于 TestsUnder5，它看起来像这样可能有效：

df['TestsUnder5'] = (df.TestType < 5).groupby(df.Name).apply(np.cumsum)

Similarly, for the percentage, you can use a binary union to get the tests under 5 that were taken:同样，对于百分比，您可以使用二元并集来获取 5 以下的测试：

df['TestPCunder5'] = (
    (
        ((df.Test == 'Yes') & (df.TestType < 5))
        .groupby(df.Name).apply(np.cumsum)
    ) / df['TestsUnder5']
)

Your example results appear to be strings formatted as "{:.0%}".您的示例结果似乎是格式为“{:.0%}”的字符串。 If that's what you're looking for, you can coerce this column to string:如果这就是您要查找的内容，则可以将此列强制转换为字符串：

df['TestPCunder5'] = df['TestPCunder5'].apply('{:.0%}'.format)

Answer 2

This is my approach:这是我的方法：

newdf = (df.assign(TestUnder5=df.TestType.lt(5),
          TestTaken=df.TestType.lt(5) & df.Test.eq('Yes')
         )
   .groupby('Name')
   [['TestUnder5','TestTaken']]
   .cumsum()
)

# update original dataframe
df['TestUnder5'] = newdf['TestUnder5']
df['TestPCunder5'] = newdf['TestTaken']/newdf['TestUnder5']

Output: Output：

   Name  TestType Test  TestUnder5  TestPCunder5
0  John         3  Yes         1.0      1.000000
1  John         4   No         2.0      0.500000
2  Alex         2   No         1.0      0.000000
3  Alex         6   No         1.0      0.000000
4  John         7   No         2.0      0.500000
5  John         2  Yes         3.0      0.666667
6  Alex         1  Yes         2.0      0.500000

如何使用，按条件分组，然后在 Pandas DataFrame 中进行计数

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-05-21 01:59:34

解决方案2
1 2020-05-21 02:09:32

如何使用，按条件分组，然后在 Pandas DataFrame 中进行计数

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-05-21 01:59:34

解决方案2 1 2020-05-21 02:09:32

解决方案1
1 已采纳 2020-05-21 01:59:34

解决方案2
1 2020-05-21 02:09:32