Groupby 根据先前可用值和下一个可用值的平均值填充 dataframe 中的缺失值

Question

I have data frame which has some groups and I want to fill the missing values based on last previous available and next value available average of score column ie (previous value+next value)/2.我有包含一些组的数据框，我想根据得分列的上一个可用值和下一个可用值平均值填充缺失值，即（上一个值+下一个值）/2。

I want to group by state,school,class,subject and then fill value.我想按 state,school,class,subject 分组，然后填写值。

If the first value not available in score column then fill the value with value which is available next or If the last value not available then fill the value with value which is available previously for each group this needs to be followed.如果分数列中的第一个值不可用，则使用下一个可用的值填充该值，或者如果最后一个值不可用，则使用每个组之前可用的值填充该值，这需要遵循。

It is data imputation complex problem.这是数据插补复杂问题。 I searched online and found pandas has some functionality ie pandas.core.groupby.DataFrameGroupBy.ffill but dont know how to use in this case.我在网上搜索，发现 pandas 有一些功能，即 pandas.core.groupby.DataFrameGroupBy.ffill 但不知道在这种情况下如何使用。

I am thinking to solve in python,pyspark,SQL !我正在考虑解决 python,pyspark,SQL ！

My data frame looks like this我的数据框看起来像这样

Answer 1

Perhaps this is helpful -也许这有帮助-

Load the test data加载测试数据

df2.show(false)
    df2.printSchema()
    /**
      * +-----+-----+
      * |class|score|
      * +-----+-----+
      * |A    |null |
      * |A    |46   |
      * |A    |null |
      * |A    |null |
      * |A    |35   |
      * |A    |null |
      * |A    |null |
      * |A    |null |
      * |A    |46   |
      * |A    |null |
      * |A    |null |
      * |B    |78   |
      * |B    |null |
      * |B    |null |
      * |B    |null |
      * |B    |null |
      * |B    |null |
      * |B    |56   |
      * |B    |null |
      * +-----+-----+
      *
      * root
      * |-- class: string (nullable = true)
      * |-- score: integer (nullable = true)
      */

Impute Null values from score columns(check new_score column)从分数列估算 Null 值（检查 new_score 列）


    val w1 = Window.partitionBy("class").rowsBetween(Window.unboundedPreceding, Window.currentRow)
    val w2 = Window.partitionBy("class").rowsBetween(Window.currentRow, Window.unboundedFollowing)
    df2.withColumn("previous", last("score", ignoreNulls = true).over(w1))
      .withColumn("next", first("score", ignoreNulls = true).over(w2))
      .withColumn("new_score", (coalesce($"previous", $"next") + coalesce($"next", $"previous")) / 2)
      .drop("next", "previous")
      .show(false)

    /**
      * +-----+-----+---------+
      * |class|score|new_score|
      * +-----+-----+---------+
      * |A    |null |46.0     |
      * |A    |46   |46.0     |
      * |A    |null |40.5     |
      * |A    |null |40.5     |
      * |A    |35   |35.0     |
      * |A    |null |40.5     |
      * |A    |null |40.5     |
      * |A    |null |40.5     |
      * |A    |46   |46.0     |
      * |A    |null |46.0     |
      * |A    |null |46.0     |
      * |B    |78   |78.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |56   |56.0     |
      * |B    |null |56.0     |
      * +-----+-----+---------+
      */

Groupby 根据先前可用值和下一个可用值的平均值填充 dataframe 中的缺失值

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-07-26 05:27:29

Load the test data加载测试数据

Impute Null values from score columns(check new_score column)从分数列估算 Null 值（检查 new_score 列）

Groupby 根据先前可用值和下一个可用值的平均值填充 dataframe 中的缺失值

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-07-26 05:27:29

Load the test data加载测试数据

Impute Null values from score columns(check new_score column)从分数列估算 Null 值（检查 new_score 列）

解决方案1
1 已采纳 2020-07-26 05:27:29