使用來自同一列的平均值填充Pyspark數據幀列空值

Question

使用這樣的數據幀，

rdd_2 = sc.parallelize([(0,10,223,"201601"), (0,10,83,"2016032"),(1,20,None,"201602"),(1,20,3003,"201601"), (1,20,None,"201603"), (2,40, 2321,"201601"), (2,30, 10,"201602"),(2,61, None,"201601")])

df_data = sqlContext.createDataFrame(rdd_2, ["id", "type", "cost", "date"])
df_data.show()

+---+----+----+-------+
| id|type|cost|   date|
+---+----+----+-------+
|  0|  10| 223| 201601|
|  0|  10|  83|2016032|
|  1|  20|null| 201602|
|  1|  20|3003| 201601|
|  1|  20|null| 201603|
|  2|  40|2321| 201601|
|  2|  30|  10| 201602|
|  2|  61|null| 201601|
+---+----+----+-------+

我需要用現有值的平均值填充空值，預期結果為

+---+----+----+-------+
| id|type|cost|   date|
+---+----+----+-------+
|  0|  10| 223| 201601|
|  0|  10|  83|2016032|
|  1|  20|1128| 201602|
|  1|  20|3003| 201601|
|  1|  20|1128| 201603|
|  2|  40|2321| 201601|
|  2|  30|  10| 201602|
|  2|  61|1128| 201601|
+---+----+----+-------+

其中1128是現有值的平均值。 我需要為幾個列做這個。

我目前的做法是使用na.fill ：

fill_values = {column: df_data.agg({column:"mean"}).flatMap(list).collect()[0] for column in df_data.columns if column not in ['date','id']}
df_data = df_data.na.fill(fill_values)

+---+----+----+-------+
| id|type|cost|   date|
+---+----+----+-------+
|  0|  10| 223| 201601|
|  0|  10|  83|2016032|
|  1|  20|1128| 201602|
|  1|  20|3003| 201601|
|  1|  20|1128| 201603|
|  2|  40|2321| 201601|
|  2|  30|  10| 201602|
|  2|  61|1128| 201601|
+---+----+----+-------+

但這非常麻煩。 有任何想法嗎？

Answer 1

那么，你必須這樣或那樣：

計算統計
填空白

它幾乎限制了你在這里真正改進的地方，仍然：

用first()[0]或結構解包替換flatMap(list).collect()[0]
使用單個操作計算所有統計數據
使用內置的Row方法來提取字典

最終結果可能是這樣的：

def fill_with_mean(df, exclude=set()): 
    stats = df.agg(*(
        avg(c).alias(c) for c in df.columns if c not in exclude
    ))
    return df.na.fill(stats.first().asDict())

fill_with_mean(df_data, ["id", "date"])

在Spark 2.2或更高版本中，您還可以使用Imputer 。 請參閱使用mean替換缺失值 - Spark Dataframe 。

使用來自同一列的平均值填充Pyspark數據幀列空值

問題描述

1 個解決方案

解決方案1
12 已采納 2016-06-10 15:10:11

使用來自同一列的平均值填充Pyspark數據幀列空值

問題描述

1 個解決方案

解決方案1 12 已采納 2016-06-10 15:10:11

解決方案1
12 已采納 2016-06-10 15:10:11