pandas 中的条件加权平均计算

Question

I have 2 Dataframes as below我有2个数据框如下

Teacher_Commission_df as below Teacher_Commission_df如下

+---------+---------+----------+---------+
| Subject |  Harare | Redcliff |  Norton |
+---------+---------+----------+---------+
| Science |  0.100  |   0.125  |  0.145  |
+---------+---------+----------+---------+
| English |  0.125  |   0.150  |  0.170  |
+---------+---------+----------+---------+
|  Maths  |  0.090  |   0.115  |  0.135  |
+---------+---------+----------+---------+
|  Music  |  0.100  |   0.125  |  0.145  |
+---------+---------+----------+---------+
|  Total  |  0.415  |   0.515  |  0.595  |
+---------+---------+----------+---------+

Students_df as below. Students_df如下。 (Note No students for Maths in Harare and Norton ) （注意Harare和Norton没有学生学习Maths ）

+---------+--------+----------+--------+
| Subject | Harare | Redcliff | Norton |
+---------+--------+----------+--------+
| Science |   15   |    18    |   20   |
+---------+--------+----------+--------+
| English |   35   |    33    |   31   |
+---------+--------+----------+--------+
|  Maths  |        |    25    |        |
+---------+--------+----------+--------+
|  Music  |   40   |    42    |   45   |
+---------+--------+----------+--------+

I need to calculate the weighted average commission of each city, with a condition.我需要计算每个城市的加权平均佣金，有条件。

First of all I'll give the desired output and explain the methodology.首先，我将给出所需的 output 并解释方法。

desired output is s below.所需的 output 如下所示。

+------------+--------+----------+--------+
| Total_Paid | Harare | Redcliff | Norton |
+------------+--------+----------+--------+
|   Science  |  4.62  |   4.37   |  6.30  |
+------------+--------+----------+--------+
|   English  |  13.46 |   9.61   |  11.46 |
+------------+--------+----------+--------+
|    Maths   |  0.00  |   5.58   |  0.00  |
+------------+--------+----------+--------+
|    Music   |  12.31 |   10.19  |  14.18 |
+------------+--------+----------+--------+

Calculation methodology计算方法

if in any city column [Harare, Redcliff, Norton] , if students of any Subject [Science, English, Maths, Music] is zero then that particular subject 's Teacher_Commission should be removed in the weight.如果在任何城市列[Harare, Redcliff, Norton]中，如果任何科目[Science, English, Maths, Music]的学生为零，则应删除该特定subject的Teacher_Commission权重。

For example, in Students_df : Take city Harare column of Science subject.例如，在Students_df中：取Science科目的 city Harare列。 since, Maths is zero in Harare , the teacher_Commission will be calculated as follows.由于Harare的Maths zero ， teacher_Commission将按如下方式计算。 15 * [0.10 / (0.415 - 0.09)] = 4.62 note the 0.09 removal in the denominator of the total. 15 * [0.10 / (0.415 - 0.09)] = 4.62请注意在总数的分母中去除了0.09 。 where as in Radcliff it is calculated without removal as 18 * [0.125 / 0.515] = 4.37其中与在Radcliff中一样，它在不移除的情况下计算为18 * [0.125 / 0.515] = 4.37

I hope my explanation is clear.我希望我的解释清楚。

This can be easily done in Microsoft Excel by using an IF condition.这可以通过使用IF条件在Microsoft Excel中轻松完成。 But, I'm looking for a scalable pandas solution.但是，我正在寻找可扩展的 pandas 解决方案。

I'm not sure how to start the calculation process.我不确定如何开始计算过程。 Hence, please give me a kick start to solve this.因此，请给我一个开始解决这个问题。

-----------------------------------------------------------------------------------------
 UPDATE
  I've managed to solve this. Refer to my answer below and suggest for any improvements
------------------------------------------------------------------------------------------

Answer 1

So, what you need is the row/column index of every empty-null value in the dataframe?那么，您需要的是 dataframe 中每个空空值的行/列索引吗？

You can use numpy.where().您可以使用 numpy.where()。 Depending on the data type of your null object you could根据您的 null object 的数据类型，您可以

Load df as np array将 df 加载为 np 数组
I,j = np.where(“NaN”) I,j = np.where(“NaN”)
i and j are now indexes you can use to eliminate the weights if the sizes are same or use dataframe.index to find which weight to remove. i 和 j 现在是索引，如果大小相同，您可以使用它来消除权重，或者使用 dataframe.index 来查找要删除的权重。

Replace NaN with Null or “” depending on your dtype将 NaN 替换为 Null 或“”，具体取决于您的 dtype

This is similar to what you'd do in excel using an IF这类似于您使用 IF 在 excel 中所做的

Personally I would just make a copy dataframe binary ie put a 1 wherever there is a non null value in the dataframe and 0 at null location, then just miltiple the two vectors. Personally I would just make a copy dataframe binary ie put a 1 wherever there is a non null value in the dataframe and 0 at null location, then just miltiple the two vectors. But thats probably more processing overhead但这可能是更多的处理开销

Answer 2

Solution using pandas使用 pandas 的解决方案

This is actually just two lines of code using pandas:这实际上只是使用 pandas 的两行代码：

import numpy as np
df_tmp = teacher_commission_df[~students_df.isnull()]
df = (df_tmp.div(df_tmp.apply(np.nansum, axis=0)) * students_df).fillna(0)

Outcome ^{(With the new 3 digits precision data.)}结果^{（使用新的3 位精度数据。）}

In [1]: df
Out[1]:
            Harare   Redcliff     Norton
Subject
Science   4.615385   4.368932   6.304348
English  13.461538   9.611650  11.456522
Maths     0.000000   5.582524   0.000000
Music    12.307692  10.194175  14.184783

Explanation of the code above上面代码的解释

^{Note : This explanation uses the 2 digit precision data given in the original question.}^{注意：此解释使用原始问题中给出的2 位精度数据。}

First, you may use boolean indexing, by using the DataFrame.isnull()首先，您可以使用 boolean 索引，通过使用DataFrame.isnull()

In [1]: students_df.isnull()
Out[1]:
         Harare  Redcliff  Norton
Subject
Science   False     False   False
English   False     False   False
Maths      True     False    True
Music     False     False   False

Then, you can select the non-null values from the teacher_commission_df using boolean indexing and the not operator ( ~ ).然后，您可以teacher_commission_df使用boolean 索引和非运算符 ( ~ ) 从 teacher_commission_df 获取非空值。

In [3]: teacher_commission_df[~students_df.isnull()]
Out[3]:
         Harare  Redcliff  Norton
Subject
Science    0.10      0.13    0.15
English    0.13      0.15    0.17
Maths       NaN      0.12     NaN
Music      0.10      0.13    0.15

Let's save this temporary dataframe into new variable, df_tmp :让我们将这个临时 dataframe 保存到新变量df_tmp ：

In [12]: df_tmp = teacher_commission_df[~students_df.isnull()]

Now, we want to divide value in each cell by the sum of the column values.现在，我们要将每个单元格中的值除以列值的总和。 The sum of column values is calculated, ignoring nans, with the help of apply() and np.nansum :在apply()和 np.nansum 的帮助下，计算列值的总和，忽略np.nansum ：

In [14]: df_tmp.apply(np.nansum, axis=0)
Out[14]:
Harare      0.33
Redcliff    0.53
Norton      0.47
dtype: float64

Then, combine the summing with division, using DataFrame.div() :然后，使用DataFrame.div()将求和与除法相结合：

In [15]: df_tmp.div(df_tmp.apply(np.nansum, axis=0))
Out[15]:
           Harare  Redcliff    Norton
Subject
Science  0.303030  0.245283  0.319149
English  0.393939  0.283019  0.361702
Maths         NaN  0.226415       NaN
Music    0.303030  0.245283  0.319149

Then, multiply the dataframes (elementwise multiplication):然后，将数据帧相乘（按元素相乘）：

In [16]: df_tmp.div(df_tmp.apply(np.nansum, axis=0)) * students_df
Out[16]:
            Harare   Redcliff     Norton
Subject
Science   4.545455   4.415094   6.382979
English  13.787879   9.339623  11.212766
Maths          NaN   5.660377        NaN
Music    12.121212  10.301887  14.361702

Lastly, fill NaN values with zeroes with DataFrame.fillna() :最后，使用DataFrame.fillna()用零填充NaN值：

In [17]: (df_tmp.div(df_tmp.apply(np.nansum, axis=0)) * students_df).fillna(0)
Out[17]:
            Harare   Redcliff     Norton
Subject
Science   4.545455   4.415094   6.382979
English  13.787879   9.339623  11.212766
Maths     0.000000   5.660377   0.000000
Music    12.121212  10.301887  14.361702

Answer 3

Based on the suggestion given by User: aak .根据User: aak 。 I've managed to solve this purely from numpy .我已经设法完全从numpy解决了这个问题。

# Load data and fill N/A values
Teacher_Commission_df = pd.read_excel('data_Teacher.xlsx',index_col='Subject', skipfooter=1)
Students_df = pd.read_excel('data_Studenst.xlsx',index_col='Subject')
Students_df.fillna(value=0, inplace= True)


# Convert Dataframes to Numpy Arrays
T = Teacher_Commission_df.to_numpy(dtype='float')
S = Students_df.to_numpy(dtype='float')

# Filter index of ZERO values from Students Numpy Array and 
# replace the correponding Values in teachers Numpy Array
T[np.where(S == 0)] = 0

# creat a temporary Sum numpy array for calculation
Total_Teacher = T.sum(axis=0)

#calculate incentives
Calculations = T * (S/Total_Teacher)

incentives = (pd.DataFrame(Calculations, columns=Students_df.columns, index=Students_df.index)
                  .round(decimals=2)
                  .reset_index())
incentives

pandas 中的条件加权平均计算

问题描述

3 个解决方案

解决方案1
1

解决方案2
1 已采纳 2020-07-25 15:19:51

Solution using pandas使用 pandas 的解决方案

Explanation of the code above上面代码的解释

解决方案3
0 2020-07-25 14:47:45

pandas 中的条件加权平均计算

问题描述

3 个解决方案

解决方案1 1

解决方案2 1 已采纳 2020-07-25 15:19:51

Solution using pandas使用 pandas 的解决方案

Explanation of the code above上面代码的解释

解决方案3 0 2020-07-25 14:47:45

解决方案1
1

解决方案2
1 已采纳 2020-07-25 15:19:51

解决方案3
0 2020-07-25 14:47:45