簡體   English   中英

Spark SQL - 用默認值替換空值

[英]Spark SQL - replace nulls with default values

我有以下數據幀架構:

root
 |-- firstname: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- cities: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- postcode: string (nullable = true

我的數據框看起來像這樣:

+---------+--------+-----------------------------------+
|firstname|lastname|cities                             |
+---------+--------+-----------------------------------+
|John     |Doe     |[[New York,A000000], [Warsaw,null]]|
|John     |Smith   |[[Berlin,null]]                    |
|John     |null    |[[Paris,null]]                     |
+---------+--------+-----------------------------------+

我想用字符串“unknown”替換所有空值。 當我使用na.fill函數時,我得到以下數據幀:

df.na.fill("unknown").show()

+---------+--------+-----------------------------------+
|firstname|lastname|cities                             |
+---------+--------+-----------------------------------+
|John     |Doe     |[[New York,A000000], [Warsaw,null]]|
|John     |Smith   |[[Berlin,null]]                    |
|John     |unknown |[[Paris,null]]                     |
+---------+--------+-----------------------------------+

如何替換dataframe中的所有空值(包括嵌套數組)?

na.fill不會在數組列的struct字段中填充null元素。 一種方法是使用UDF,如下所示:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row

case class City(name: String, postcode: String)

val df = Seq(
  ("John", "Doe", Seq(City("New York", "A000000"), City("Warsaw", null))),
  ("John", "Smith", Seq(City("Berlin", null))),
  ("John", null, Seq(City("Paris", null)))
).toDF("firstname", "lastname", "cities")

val defaultStr = "unknown"

def patchNull(default: String) = udf( (s: Seq[Row]) =>
  s.map( r => (r.getAs[String]("name"), r.getAs[String]("postcode")) match {
      case (null, null) => (default, default)
      case (c, null) => (c, default)
      case (null, p) => (default, p)
      case e => e
    }
  ) )

df.
  withColumn( "cities", patchNull(defaultStr)($"cities") ).
  na.fill(defaultStr).
  show(false)
// +---------+--------+--------------------------------------+
// |firstname|lastname|cities                                |
// +---------+--------+--------------------------------------+
// |John     |Doe     |[[New York,A000000], [Warsaw,unknown]]|
// |John     |Smith   |[[Berlin,unknown]]                    |
// |John     |unknown |[[Paris,unknown]]                     |
// +---------+--------+--------------------------------------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM