简体   繁体   English

如何使用,按条件分组,然后在 Pandas DataFrame 中进行计数

[英]How to use, groupby with conditions and then cumcount in Pandas DataFrame

I have the following dataset:我有以下数据集:

df = pd.DataFrame(
    [
        ['John', 3, Yes],
        ['John', 4, No],
        ['Alex', 2, No],
        ['Alex', 6, No],
        ['John', 7, No],
        ['John', 2, Yes],
        ['Alex', 1, Yes]
    ], columns = ['Name', 'TestType','Test'])

Giving me:给我:

print(df):

Name        TestType          Test
John         3                 Yes
John         4                 No
Alex         2                 No
Alex         6                 No
John         7                 No
John         2                 Yes 
Alex         1                 Yes 

The table is in Chronological order so what I am trying to achieve is an up to date count of tests taken where TestType is less than 5 and a percentage count of tests taken by a person where TestType is less than 5.该表是按时间顺序排列的,所以我想要实现的是在TestType小于 5 的情况下进行的测试的最新计数以及在TestType小于 5 的情况下进行的测试的百分比计数。

I am hoping for the output to be:我希望 output 是:

print (df):
Name        TestType          Test       TestsUnder5      TestPCunder5
John         3                 Yes            1              100%
John         4                 No             2              50%
Alex         2                 No             1              0%
Alex         6                 Yes            1              0% 
John         7                 No             2              50%
John         2                 Yes            3              67%
Alex         1                 Yes            2              50%

I think I need to use groupby and cumsum but not sure how to specify the condition and then perform the calculation.我想我需要使用groupbycumsum但不确定如何指定条件然后执行计算。 Any help would be much appreciated!任何帮助将非常感激!

Almost there, You can apply math operators to boolean series.差不多了,您可以将数学运算符应用于 boolean 系列。 which coerce them to integers 0 or 1: For TestsUnder5 it looks like this might work:将它们强制为整数 0 或 1:对于 TestsUnder5,它看起来像这样可能有效:

df['TestsUnder5'] = (df.TestType < 5).groupby(df.Name).apply(np.cumsum)

Similarly, for the percentage, you can use a binary union to get the tests under 5 that were taken:同样,对于百分比,您可以使用二元并集来获取 5 以下的测试:

df['TestPCunder5'] = (
    (
        ((df.Test == 'Yes') & (df.TestType < 5))
        .groupby(df.Name).apply(np.cumsum)
    ) / df['TestsUnder5']
)

Your example results appear to be strings formatted as "{:.0%}".您的示例结果似乎是格式为“{:.0%}”的字符串。 If that's what you're looking for, you can coerce this column to string:如果这就是您要查找的内容,则可以将此列强制转换为字符串:

df['TestPCunder5'] = df['TestPCunder5'].apply('{:.0%}'.format)

This is my approach:这是我的方法:

newdf = (df.assign(TestUnder5=df.TestType.lt(5),
          TestTaken=df.TestType.lt(5) & df.Test.eq('Yes')
         )
   .groupby('Name')
   [['TestUnder5','TestTaken']]
   .cumsum()
)

# update original dataframe
df['TestUnder5'] = newdf['TestUnder5']
df['TestPCunder5'] = newdf['TestTaken']/newdf['TestUnder5']

Output: Output:

   Name  TestType Test  TestUnder5  TestPCunder5
0  John         3  Yes         1.0      1.000000
1  John         4   No         2.0      0.500000
2  Alex         2   No         1.0      0.000000
3  Alex         6   No         1.0      0.000000
4  John         7   No         2.0      0.500000
5  John         2  Yes         3.0      0.666667
6  Alex         1  Yes         2.0      0.500000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM