简体   繁体   English

Groupby 根据先前可用值和下一个可用值的平均值填充 dataframe 中的缺失值

[英]Groupby fill missing values in dataframe based on average of previous values available and next value available

I have data frame which has some groups and I want to fill the missing values based on last previous available and next value available average of score column ie (previous value+next value)/2.我有包含一些组的数据框,我想根据得分列的上一个可用值和下一个可用值平均值填充缺失值,即(上一个值+下一个值)/2。

I want to group by state,school,class,subject and then fill value.我想按 state,school,class,subject 分组,然后填写值。

If the first value not available in score column then fill the value with value which is available next or If the last value not available then fill the value with value which is available previously for each group this needs to be followed.如果分数列中的第一个值不可用,则使用下一个可用的值填充该值,或者如果最后一个值不可用,则使用每个组之前可用的值填充该值,这需要遵循。

It is data imputation complex problem.这是数据插补复杂问题。 I searched online and found pandas has some functionality ie pandas.core.groupby.DataFrameGroupBy.ffill but dont know how to use in this case.我在网上搜索,发现 pandas 有一些功能,即 pandas.core.groupby.DataFrameGroupBy.ffill 但不知道在这种情况下如何使用。

I am thinking to solve in python,pyspark,SQL !我正在考虑解决 python,pyspark,SQL !

My data frame looks like this我的数据框看起来像这样

缺失值

数据插补

Perhaps this is helpful -也许这有帮助-

Load the test data加载测试数据

df2.show(false)
    df2.printSchema()
    /**
      * +-----+-----+
      * |class|score|
      * +-----+-----+
      * |A    |null |
      * |A    |46   |
      * |A    |null |
      * |A    |null |
      * |A    |35   |
      * |A    |null |
      * |A    |null |
      * |A    |null |
      * |A    |46   |
      * |A    |null |
      * |A    |null |
      * |B    |78   |
      * |B    |null |
      * |B    |null |
      * |B    |null |
      * |B    |null |
      * |B    |null |
      * |B    |56   |
      * |B    |null |
      * +-----+-----+
      *
      * root
      * |-- class: string (nullable = true)
      * |-- score: integer (nullable = true)
      */

Impute Null values from score columns(check new_score column)从分数列估算 Null 值(检查 new_score 列)


    val w1 = Window.partitionBy("class").rowsBetween(Window.unboundedPreceding, Window.currentRow)
    val w2 = Window.partitionBy("class").rowsBetween(Window.currentRow, Window.unboundedFollowing)
    df2.withColumn("previous", last("score", ignoreNulls = true).over(w1))
      .withColumn("next", first("score", ignoreNulls = true).over(w2))
      .withColumn("new_score", (coalesce($"previous", $"next") + coalesce($"next", $"previous")) / 2)
      .drop("next", "previous")
      .show(false)

    /**
      * +-----+-----+---------+
      * |class|score|new_score|
      * +-----+-----+---------+
      * |A    |null |46.0     |
      * |A    |46   |46.0     |
      * |A    |null |40.5     |
      * |A    |null |40.5     |
      * |A    |35   |35.0     |
      * |A    |null |40.5     |
      * |A    |null |40.5     |
      * |A    |null |40.5     |
      * |A    |46   |46.0     |
      * |A    |null |46.0     |
      * |A    |null |46.0     |
      * |B    |78   |78.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |56   |56.0     |
      * |B    |null |56.0     |
      * +-----+-----+---------+
      */

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM