Scala：从 Spark DataFrame 中删除空数组值

Question

I'm a new learner of Scala.我是 Scala 的新学习者。 Now given a DataFrame named df as follows:现在给出一个名为 df 的 DataFrame 如下：

+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]|  [0.0]|  [0.0]| [null]|
| [IND1]|  [5.0]|  [6.0]|    [A]|
| [IND2]|  [7.0]|  [8.0]|    [B]|
|     []|     []|     []|     []|
+-------+-------+-------+-------+

I'd like to delete rows if all columns is an empty array (4th row).如果所有列都是空数组（第 4 行），我想删除行。

For example I might expect the result is:例如，我可能期望结果是：

+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]|  [0.0]|  [0.0]| [null]|
| [IND1]|  [5.0]|  [6.0]|    [A]|
| [IND2]|  [7.0]|  [8.0]|    [B]|
+-------+-------+-------+-------+

I'm trying to use isNotNull (like val temp=df.filter(col("Column1").isNotNull && col("Column2").isNotNull && col("Column3").isNotNull && col("Column4").isNotNull).show() ) but still show all rows.我正在尝试使用 isNotNull （例如val temp=df.filter(col("Column1").isNotNull && col("Column2").isNotNull && col("Column3").isNotNull && col("Column4").isNotNull).show() ）但仍显示所有行。

I found python solution of using a Hive UDF from link , but I had hard time trying to convert to a valid scala code.我从link找到了使用 Hive UDF 的 python 解决方案，但我很难尝试转换为有效的 scala 代码。 I would like use scala command similar to the following code:我想使用类似于以下代码的 scala 命令：

val query = "SELECT * FROM targetDf WHERE {0}".format(" AND ".join("SIZE({0}) > 0".format(c) for c in ["Column1", "Column2", "Column3","Column4"]))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.sql(query)

Any help would be appreciated.任何帮助，将不胜感激。 Thank you.谢谢你。

Answer 1

Using the isNotNull or isNull will not work because it is looking for a 'null' value in the DataFrame.使用 isNotNull 或 isNull 将不起作用，因为它正在寻找 DataFrame 中的“空”值。 Your example DF does not contain null values but empty values, there is a difference there.您的示例 DF 不包含空值，而是包含空值，那里有所不同。

One option: You could create a new column that has the length of of the array and filter for if the array is zero.一种选择：您可以创建一个具有数组长度的新列，并过滤数组是否为零。

  val dfFil = df
    .withColumn("arrayLengthColOne", size($"Column1"))
    .withColumn("arrayLengthColTwo", size($"Column2"))
    .withColumn("arrayLengthColThree", size($"Column3"))
    .withColumn("arrayLengthColFour", size($"Column4"))
    .filter($"arrayLengthColOne" =!= 0 && $"arrayLengthColTwo" =!= 0 
    && $"arrayLengthColThree" =!= 0 && $"arrayLengthColFour" =!= 0)
    .drop("arrayLengthColOne", "arrayLengthColTwo", "arrayLengthColThree", "arrayLengthColFour")

Original DF:原始DF：

+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
|    [A]|    [B]|    [C]|    [d]|
|     []|     []|     []|     []|
+-------+-------+-------+-------+

New DF:新DF：

+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
|    [A]|    [B]|    [C]|    [d]|
+-------+-------+-------+-------+

You could also create a function that will map across all the columns and do it.您还可以创建一个函数来映射所有列并执行此操作。

Answer 2

Another approach (in addition to accepted answer) would be using Datasets .另一种方法（除了接受的答案）是使用Datasets 。
For example, by having a case class:例如，通过有一个案例类：

case class MyClass(col1: Seq[String],
                   col2: Seq[Double],
                   col3: Seq[Double],
                   col4: Seq[String]) { 
    def isEmpty: Boolean = ...
}

You can represent your source as a typed structure:您可以将源表示为类型化结构：

import spark.implicits._ // needed to provide an implicit encoder/data mapper 

val originalSource: DataFrame = ... // provide your source
val source: Dataset[MyClass] = originalSource.as[MyClass] // convert/map it to Dataset

So you could do filtering like following:所以你可以做如下过滤：

source.filter(element => !element.isEmpty) // calling class's instance method

Scala：从 Spark DataFrame 中删除空数组值

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-11-11 18:34:02

解决方案2
1 2018-11-12 10:27:54

Scala：从 Spark DataFrame 中删除空数组值

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-11-11 18:34:02

解决方案2 1 2018-11-12 10:27:54

解决方案1
2 已采纳 2018-11-11 18:34:02

解决方案2
1 2018-11-12 10:27:54