[英]Conditional Weighted Average calculation in pandas
I have 2 Dataframes as below我有2个数据框如下
Teacher_Commission_df
as below Teacher_Commission_df
如下
+---------+---------+----------+---------+
| Subject | Harare | Redcliff | Norton |
+---------+---------+----------+---------+
| Science | 0.100 | 0.125 | 0.145 |
+---------+---------+----------+---------+
| English | 0.125 | 0.150 | 0.170 |
+---------+---------+----------+---------+
| Maths | 0.090 | 0.115 | 0.135 |
+---------+---------+----------+---------+
| Music | 0.100 | 0.125 | 0.145 |
+---------+---------+----------+---------+
| Total | 0.415 | 0.515 | 0.595 |
+---------+---------+----------+---------+
Students_df
as below. Students_df
如下。 (Note No students for Maths
in Harare
and Norton
) (注意Harare
和Norton
没有学生学习Maths
)
+---------+--------+----------+--------+
| Subject | Harare | Redcliff | Norton |
+---------+--------+----------+--------+
| Science | 15 | 18 | 20 |
+---------+--------+----------+--------+
| English | 35 | 33 | 31 |
+---------+--------+----------+--------+
| Maths | | 25 | |
+---------+--------+----------+--------+
| Music | 40 | 42 | 45 |
+---------+--------+----------+--------+
I need to calculate the weighted average commission of each city, with a condition.我需要计算每个城市的加权平均佣金,有条件。
First of all I'll give the desired output and explain the methodology.首先,我将给出所需的 output 并解释方法。
desired output is s below.所需的 output 如下所示。
+------------+--------+----------+--------+
| Total_Paid | Harare | Redcliff | Norton |
+------------+--------+----------+--------+
| Science | 4.62 | 4.37 | 6.30 |
+------------+--------+----------+--------+
| English | 13.46 | 9.61 | 11.46 |
+------------+--------+----------+--------+
| Maths | 0.00 | 5.58 | 0.00 |
+------------+--------+----------+--------+
| Music | 12.31 | 10.19 | 14.18 |
+------------+--------+----------+--------+
Calculation methodology计算方法
if in any city column [Harare, Redcliff, Norton]
, if students of any Subject [Science, English, Maths, Music]
is zero then that particular subject
's Teacher_Commission
should be removed in the weight.如果在任何城市列[Harare, Redcliff, Norton]
中,如果任何科目[Science, English, Maths, Music]
的学生为零,则应删除该特定subject
的Teacher_Commission
权重。
For example, in Students_df
: Take city Harare
column of Science
subject.例如,在Students_df
中:取Science
科目的 city Harare
列。 since, Maths
is zero
in Harare
, the teacher_Commission
will be calculated as follows.由于Harare
的Maths
zero
, teacher_Commission
将按如下方式计算。 15 * [0.10 / (0.415 - 0.09)] = 4.62
note the 0.09
removal in the denominator of the total. 15 * [0.10 / (0.415 - 0.09)] = 4.62
请注意在总数的分母中去除了0.09
。 where as in Radcliff
it is calculated without removal as 18 * [0.125 / 0.515] = 4.37
其中与在Radcliff
中一样,它在不移除的情况下计算为18 * [0.125 / 0.515] = 4.37
I hope my explanation is clear.我希望我的解释清楚。
This can be easily done in Microsoft Excel
by using an IF
condition.这可以通过使用IF
条件在Microsoft Excel
中轻松完成。 But, I'm looking for a scalable pandas solution.但是,我正在寻找可扩展的 pandas 解决方案。
I'm not sure how to start the calculation process.我不确定如何开始计算过程。 Hence, please give me a kick start to solve this.因此,请给我一个开始解决这个问题。
-----------------------------------------------------------------------------------------
UPDATE
I've managed to solve this. Refer to my answer below and suggest for any improvements
------------------------------------------------------------------------------------------
So, what you need is the row/column index of every empty-null value in the dataframe?那么,您需要的是 dataframe 中每个空空值的行/列索引吗?
You can use numpy.where().您可以使用 numpy.where()。 Depending on the data type of your null object you could根据您的 null object 的数据类型,您可以
Replace NaN with Null or “” depending on your dtype将 NaN 替换为 Null 或“”,具体取决于您的 dtype
This is similar to what you'd do in excel using an IF这类似于您使用 IF 在 excel 中所做的
Personally I would just make a copy dataframe binary ie put a 1 wherever there is a non null value in the dataframe and 0 at null location, then just miltiple the two vectors. Personally I would just make a copy dataframe binary ie put a 1 wherever there is a non null value in the dataframe and 0 at null location, then just miltiple the two vectors. But thats probably more processing overhead但这可能是更多的处理开销
This is actually just two lines of code using pandas:这实际上只是使用 pandas 的两行代码:
import numpy as np
df_tmp = teacher_commission_df[~students_df.isnull()]
df = (df_tmp.div(df_tmp.apply(np.nansum, axis=0)) * students_df).fillna(0)
Outcome (With the new 3 digits precision data.)结果(使用新的3 位精度数据。)
In [1]: df
Out[1]:
Harare Redcliff Norton
Subject
Science 4.615385 4.368932 6.304348
English 13.461538 9.611650 11.456522
Maths 0.000000 5.582524 0.000000
Music 12.307692 10.194175 14.184783
Note : This explanation uses the 2 digit precision data given in the original question.注意:此解释使用原始问题中给出的2 位精度数据。
In [1]: students_df.isnull()
Out[1]:
Harare Redcliff Norton
Subject
Science False False False
English False False False
Maths True False True
Music False False False
teacher_commission_df
using boolean indexing and the not operator ( ~
).然后,您可以teacher_commission_df
使用boolean 索引和非运算符 ( ~
) 从 teacher_commission_df 获取非空值。In [3]: teacher_commission_df[~students_df.isnull()]
Out[3]:
Harare Redcliff Norton
Subject
Science 0.10 0.13 0.15
English 0.13 0.15 0.17
Maths NaN 0.12 NaN
Music 0.10 0.13 0.15
df_tmp
:让我们将这个临时 dataframe 保存到新变量df_tmp
:In [12]: df_tmp = teacher_commission_df[~students_df.isnull()]
apply()
and np.nansum
:在apply()
和 np.nansum 的帮助下,计算列值的总和,忽略np.nansum
:In [14]: df_tmp.apply(np.nansum, axis=0)
Out[14]:
Harare 0.33
Redcliff 0.53
Norton 0.47
dtype: float64
DataFrame.div()
:然后,使用DataFrame.div()
将求和与除法相结合:In [15]: df_tmp.div(df_tmp.apply(np.nansum, axis=0))
Out[15]:
Harare Redcliff Norton
Subject
Science 0.303030 0.245283 0.319149
English 0.393939 0.283019 0.361702
Maths NaN 0.226415 NaN
Music 0.303030 0.245283 0.319149
In [16]: df_tmp.div(df_tmp.apply(np.nansum, axis=0)) * students_df
Out[16]:
Harare Redcliff Norton
Subject
Science 4.545455 4.415094 6.382979
English 13.787879 9.339623 11.212766
Maths NaN 5.660377 NaN
Music 12.121212 10.301887 14.361702
NaN
values with zeroes with DataFrame.fillna() :最后,使用DataFrame.fillna()用零填充NaN
值:In [17]: (df_tmp.div(df_tmp.apply(np.nansum, axis=0)) * students_df).fillna(0)
Out[17]:
Harare Redcliff Norton
Subject
Science 4.545455 4.415094 6.382979
English 13.787879 9.339623 11.212766
Maths 0.000000 5.660377 0.000000
Music 12.121212 10.301887 14.361702
Based on the suggestion given by User: aak
.根据User: aak
。 I've managed to solve this purely from numpy
.我已经设法完全从numpy
解决了这个问题。
# Load data and fill N/A values
Teacher_Commission_df = pd.read_excel('data_Teacher.xlsx',index_col='Subject', skipfooter=1)
Students_df = pd.read_excel('data_Studenst.xlsx',index_col='Subject')
Students_df.fillna(value=0, inplace= True)
# Convert Dataframes to Numpy Arrays
T = Teacher_Commission_df.to_numpy(dtype='float')
S = Students_df.to_numpy(dtype='float')
# Filter index of ZERO values from Students Numpy Array and
# replace the correponding Values in teachers Numpy Array
T[np.where(S == 0)] = 0
# creat a temporary Sum numpy array for calculation
Total_Teacher = T.sum(axis=0)
#calculate incentives
Calculations = T * (S/Total_Teacher)
incentives = (pd.DataFrame(Calculations, columns=Students_df.columns, index=Students_df.index)
.round(decimals=2)
.reset_index())
incentives
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.