简体   繁体   中英

How to replace empty values in a column of DataFrame?

How can I replace empty values in a column Field1 of DataFrame df ?

Field1 Field2
       AA
12     BB

This command does not provide an expected result:

df.na.fill("Field1",Seq("Anonymous"))

The expected result:

Field1          Field2
Anonymous       AA
12              BB

You can also try this. This might handle both blank/empty/null

df.show()
+------+------+
|Field1|Field2|
+------+------+
|      |    AA|
|    12|    BB|
|    12|  null|
+------+------+

df.na.replace(Seq("Field1","Field2"),Map(""-> null)).na.fill("Anonymous", Seq("Field2","Field1")).show(false)   

+---------+---------+
|Field1   |Field2   |
+---------+---------+
|Anonymous|AA       |
|12       |BB       |
|12       |Anonymous|
+---------+---------+   

Fill: Returns a new DataFrame that replaces null or NaN values in numeric columns with value.

Two things:

  1. An empty string is not null or NaN, so you'll have to use a case statement for that.
  2. Fill seems to not work well when giving a text value into a numeric column.

Failing Null Replace with Fill / Text:

scala> a.show
+----+---+
|  f1| f2|
+----+---+
|null| AA|
|  12| BB|
+----+---+

scala> a.na.fill("Anonymous", Seq("f1")).show
+----+---+
|  f1| f2|
+----+---+
|null| AA|
|  12| BB|
+----+---+

Working Example - Using Null With All Numbers:

scala> a.show
+----+---+
|  f1| f2|
+----+---+
|null| AA|
|  12| BB|
+----+---+


scala> a.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
|  1| AA|
| 12| BB|
+---+---+

Failing Example (Empty String instead of Null):

scala> b.show
+---+---+
| f1| f2|
+---+---+
|   | AA|
| 12| BB|
+---+---+


scala> b.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
|   | AA|
| 12| BB|
+---+---+

Case Statement Fix Example:

scala> b.show
+---+---+
| f1| f2|
+---+---+
|   | AA|
| 12| BB|
+---+---+


scala> b.select(when(col("f1") === "", "Anonymous").otherwise(col("f1")).as("f1"), col("f2")).show
+---------+---+
|       f1| f2|
+---------+---+
|Anonymous| AA|
|       12| BB|
+---------+---+

You can try using below code when you have n number of columns in dataframe.

Note: When you are trying to write data into formats like parquet, null data types are not supported. we have to type cast it.

val df = Seq(
(1, ""),
(2, "Ram"),
(3, "Sam"),
(4,"")
).toDF("ID", "Name")

// null type column

val inputDf = df.withColumn("NulType", lit(null).cast(StringType))

//Output

+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
|  1|    |   null|
|  2| Ram|   null|
|  3| Sam|   null|
|  4|    |   null|
+---+----+-------+

//Replace all blank space in the dataframe with null

val colName = inputDf.columns //*This will give you array of string*

val data = inputDf.na.replace(colName,Map(""->"null"))

data.show()
+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
|  1|null|   null|
|  2| Ram|   null|
|  3| Sam|   null|
|  4|null|   null|
+---+----+-------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM