[英]pandas dataframe filter calculation
I have the following dataframe 我有以下数据帧
student_id gender major admitted
0 35377 female Chemistry False
1 56105 male Physics True
2 31441 female Chemistry False
3 51765 male Physics True
4 53714 female Physics True
5 50693 female Chemistry False
6 25946 male Physics True
7 27648 female Chemistry True
8 55247 male Physics False
9 35838 male Physics True
How would I calculate the admission rate for female physics majors? 我如何计算女性物理专业的录取率?
import numpy as np
np.average(dat['admitted'][(dat['gender']=='female') & (dat['major']=='Physics')].values)
Working Principle: (dat['gender']=='female') & (dat['major']=='Physics')
creates a boolean pandas
Series which can be used to select the correct entries from the dat['admitted']
Series. 工作原理: (dat['gender']=='female') & (dat['major']=='Physics')
创建一个布尔pandas
系列,可用于从dat['admitted']
选择正确的条目dat['admitted']
系列。 The .values
functionality extracts those entries into a numpy array. .values
功能将这些条目提取为numpy数组。 At the end we take the average of those entries giving us the admittance ratio. 最后,我们采用这些条目的平均值给出了我们的准入率。
I think - 我认为 -
df_f = df[(df['gender']=='female') & (df['major']=='Physics')]
df_f['admitted'].mean()
First part filters female
and Physics
. 第一部分过滤female
和Physics
。 Next, we calculate mean
. 接下来,我们计算mean
。
The mean
part sounds unintuitive and weird but mathematically it will give the percentage value. mean
部分听起来不直观且很奇怪,但在数学上它会给出百分比值。 Python treats boolean
values as 0
and 1
so basically if you are summing up and dividing by the count (which mean
does) you are actually calculating the percentage of female
students with a major in Physics
who were admitted
蟒蛇把boolean
值0
和1
所以基本上,如果你正在总结和计分(这mean
做),你实际上是计算的百分比female
学生中的一大Physics
谁被admitted
import numpy as np
import pandas as pd
df = pd.DataFrame({"gender":np.random.choice(["male","female"],[20]),
"admitted":np.random.choice([True,False],[20]),
"major":np.random.choice(["Chemistry","Physics"],[20])})
phy_female_admited = df.loc[(df["major"]=="Physics") & (df["admitted"]==True) & ((df["gender"]=="female"))]
phy_female_applied = df.loc[(df["major"]=="Physics") & ((df["gender"]=="female"))]
acceptance_rate = phy_female_admited.shape[0]/phy_female_applied.shape[0]
A little more expanded answer but basically works in the same way as DZurico's 更广泛的答案,但基本上与DZurico的工作方式相同
ignore the line where i am creating a dataframe and use your own data instead 忽略我在创建数据框的行,而是使用您自己的数据
Solution for all admission rates with groupby
and GroupBy.size
, and GroupBy.transform
with sum
: 使用groupby
和GroupBy.size
以及GroupBy.transform
和sum
所有录取率的解决方案:
a = df.groupby(['gender' ,'admitted', 'major']).size()
print (a)
gender admitted major
female False Chemistry 3
True Chemistry 1
Physics 1
male False Physics 1
True Physics 4
dtype: int64
b = a.groupby(['gender' ,'major']).transform('sum')
print (b)
gender admitted major
female False Chemistry 4
True Chemistry 4
Physics 1
male False Physics 5
True Physics 5
dtype: int64
c = a.div(b)
print (c)
gender admitted major
female False Chemistry 0.75
True Chemistry 0.25
Physics 1.00
male False Physics 0.20
True Physics 0.80
dtype: float64
Select by tuples which row of c
need: 通过元组选择哪一行c
需要:
print (c.loc[('female',True,'Physics')])
1.0
If want all values in DataFrame
: 如果想要DataFrame
所有值:
d = a.div(b).reset_index(name='rates')
print (d)
gender admitted major rates
0 female False Chemistry 0.75
1 female True Chemistry 0.25
2 female True Physics 1.00
3 male False Physics 0.20
4 male True Physics 0.80
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.