[英]How to use, groupby with conditions and then cumcount in Pandas DataFrame
I have the following dataset:我有以下数据集:
df = pd.DataFrame(
[
['John', 3, Yes],
['John', 4, No],
['Alex', 2, No],
['Alex', 6, No],
['John', 7, No],
['John', 2, Yes],
['Alex', 1, Yes]
], columns = ['Name', 'TestType','Test'])
Giving me:给我:
print(df):
Name TestType Test
John 3 Yes
John 4 No
Alex 2 No
Alex 6 No
John 7 No
John 2 Yes
Alex 1 Yes
The table is in Chronological order so what I am trying to achieve is an up to date count of tests taken where TestType
is less than 5 and a percentage count of tests taken by a person where TestType
is less than 5.该表是按时间顺序排列的,所以我想要实现的是在TestType
小于 5 的情况下进行的测试的最新计数以及在TestType
小于 5 的情况下进行的测试的百分比计数。
I am hoping for the output to be:我希望 output 是:
print (df):
Name TestType Test TestsUnder5 TestPCunder5
John 3 Yes 1 100%
John 4 No 2 50%
Alex 2 No 1 0%
Alex 6 Yes 1 0%
John 7 No 2 50%
John 2 Yes 3 67%
Alex 1 Yes 2 50%
I think I need to use groupby
and cumsum
but not sure how to specify the condition and then perform the calculation.我想我需要使用groupby
和cumsum
但不确定如何指定条件然后执行计算。 Any help would be much appreciated!任何帮助将非常感激!
Almost there, You can apply math operators to boolean series.差不多了,您可以将数学运算符应用于 boolean 系列。 which coerce them to integers 0 or 1: For TestsUnder5 it looks like this might work:将它们强制为整数 0 或 1:对于 TestsUnder5,它看起来像这样可能有效:
df['TestsUnder5'] = (df.TestType < 5).groupby(df.Name).apply(np.cumsum)
Similarly, for the percentage, you can use a binary union to get the tests under 5 that were taken:同样,对于百分比,您可以使用二元并集来获取 5 以下的测试:
df['TestPCunder5'] = (
(
((df.Test == 'Yes') & (df.TestType < 5))
.groupby(df.Name).apply(np.cumsum)
) / df['TestsUnder5']
)
Your example results appear to be strings formatted as "{:.0%}".您的示例结果似乎是格式为“{:.0%}”的字符串。 If that's what you're looking for, you can coerce this column to string:如果这就是您要查找的内容,则可以将此列强制转换为字符串:
df['TestPCunder5'] = df['TestPCunder5'].apply('{:.0%}'.format)
This is my approach:这是我的方法:
newdf = (df.assign(TestUnder5=df.TestType.lt(5),
TestTaken=df.TestType.lt(5) & df.Test.eq('Yes')
)
.groupby('Name')
[['TestUnder5','TestTaken']]
.cumsum()
)
# update original dataframe
df['TestUnder5'] = newdf['TestUnder5']
df['TestPCunder5'] = newdf['TestTaken']/newdf['TestUnder5']
Output: Output:
Name TestType Test TestUnder5 TestPCunder5
0 John 3 Yes 1.0 1.000000
1 John 4 No 2.0 0.500000
2 Alex 2 No 1.0 0.000000
3 Alex 6 No 1.0 0.000000
4 John 7 No 2.0 0.500000
5 John 2 Yes 3.0 0.666667
6 Alex 1 Yes 2.0 0.500000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.