使用Spark（1.6）从Scala的Dataframe中的数组列中删除Null

Question

I have a dataframe with a key column and a column which has an array of struct. 我有一个带有关键列和一个具有结构数组的列的数据框。 The Schema looks like below. 该架构如下所示。

root
 |-- id: string (nullable = true)
 |-- desc: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- age: long (nullable = false)

The array "desc" can have any number of null values. 数组“ desc”可以具有任意数量的空值。 I would like to create a final dataframe with the array having none of the null values using spark 1.6: 我想使用spark 1.6使用数组不包含任何空值来创建最终数据框：

An example would be: 一个例子是：

Key  .   Value
1010 .   [[George,21],null,[MARIE,13],null]
1023 .   [null,[Watson,11],[John,35],null,[Kyle,33]]

I want the final dataframe as: 我希望最终数据框为：

Key  .   Value
1010 .   [[George,21],[MARIE,13]]
1023 .   [[Watson,11],[John,35],[Kyle,33]]

I tried doing this with UDF and case class but got 我尝试使用UDF和case类执行此操作，但是得到了

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to....

Any help is greatly appreciated and I would prefer doing it without converting to RDDs if needed. 非常感谢您的帮助，如果需要，我希望不进行转换而无需转换为RDD。 Also I am new to spark and scala so thanks in advance!!! 我也是火花和scala的新手，所以在此先感谢！！！

Answer 1

Here is another version: 这是另一个版本：

case class Person(name: String, age: Int)

root
 |-- id: string (nullable = true)
 |-- desc: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- age: integer (nullable = false)

+----+-----------------------------------------------+
|id  |desc                                           |
+----+-----------------------------------------------+
|1010|[[George,21], null, [MARIE,13], null]          |
|1023|[[Watson,11], null, [John,35], null, [Kyle,33]]|
+----+-----------------------------------------------+


val filterOutNull = udf((xs: Seq[Row]) => {
  xs.flatMap {
    case null => Nil
    // convert the Row back to your specific struct:
    case Row(s: String,i: Int) => List(Person(s, i))
  }
})

val result = df.withColumn("filteredListDesc", filterOutNull($"desc"))

+----+-----------------------------------------------+-----------------------------------+
|id  |desc                                           |filteredListDesc                   |
+----+-----------------------------------------------+-----------------------------------+
|1010|[[George,21], null, [MARIE,13], null]          |[[George,21], [MARIE,13]]          |
|1023|[[Watson,11], null, [John,35], null, [Kyle,33]]|[[Watson,11], [John,35], [Kyle,33]]|
+----+-----------------------------------------------+-----------------------------------+

Answer 2

Given that the original dataframe has following schema 鉴于原始数据框具有以下架构

root
 |-- id: string (nullable = true)
 |-- desc: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- age: long (nullable = false)

Defining a udf function to remove the null values from the array should work for you 定义udf函数以从数组中删除空值应该对您有用

import org.apache.spark.sql.functions._
def removeNull = udf((array: Seq[Row])=> array.filterNot(_ == null).map(x => element(x.getAs[String]("name"), x.getAs[Long]("age"))))

df.withColumn("desc", removeNull(col("desc")))

where element is a case class 其中element是case class

case class element(name: String, age: Long)

and you should get 你应该得到

+----+-----------------------------------+
|id  |desc                               |
+----+-----------------------------------+
|1010|[[George,21], [MARIE,13]]          |
|1010|[[Watson,11], [John,35], [Kyle,33]]|
+----+-----------------------------------+

使用Spark（1.6）从Scala的Dataframe中的数组列中删除Null

问题描述

2 个解决方案

解决方案1
2 2018-05-07 14:51:30

解决方案2
1 已采纳 2018-05-07 14:47:35

使用Spark（1.6）从Scala的Dataframe中的数组列中删除Null

问题描述

2 个解决方案

解决方案1 2 2018-05-07 14:51:30

解决方案2 1 已采纳 2018-05-07 14:47:35

解决方案1
2 2018-05-07 14:51:30

解决方案2
1 已采纳 2018-05-07 14:47:35