简体   繁体   English

Scala:从 Spark DataFrame 中删除空数组值

[英]Scala: Delete empty array values from a Spark DataFrame

I'm a new learner of Scala.我是 Scala 的新学习者。 Now given a DataFrame named df as follows:现在给出一个名为 df 的 DataFrame 如下:

+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]|  [0.0]|  [0.0]| [null]|
| [IND1]|  [5.0]|  [6.0]|    [A]|
| [IND2]|  [7.0]|  [8.0]|    [B]|
|     []|     []|     []|     []|
+-------+-------+-------+-------+

I'd like to delete rows if all columns is an empty array (4th row).如果所有列都是空数组(第 4 行),我想删除行。

For example I might expect the result is:例如,我可能期望结果是:

+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]|  [0.0]|  [0.0]| [null]|
| [IND1]|  [5.0]|  [6.0]|    [A]|
| [IND2]|  [7.0]|  [8.0]|    [B]|
+-------+-------+-------+-------+

I'm trying to use isNotNull (like val temp=df.filter(col("Column1").isNotNull && col("Column2").isNotNull && col("Column3").isNotNull && col("Column4").isNotNull).show() ) but still show all rows.我正在尝试使用 isNotNull (例如val temp=df.filter(col("Column1").isNotNull && col("Column2").isNotNull && col("Column3").isNotNull && col("Column4").isNotNull).show() )但仍显示所有行。

I found python solution of using a Hive UDF from link , but I had hard time trying to convert to a valid scala code.我从link找到了使用 Hive UDF 的 python 解决方案,但我很难尝试转换为有效的 scala 代码。 I would like use scala command similar to the following code:我想使用类似于以下代码的 scala 命令:

val query = "SELECT * FROM targetDf WHERE {0}".format(" AND ".join("SIZE({0}) > 0".format(c) for c in ["Column1", "Column2", "Column3","Column4"]))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.sql(query)

Any help would be appreciated.任何帮助,将不胜感激。 Thank you.谢谢你。

Using the isNotNull or isNull will not work because it is looking for a 'null' value in the DataFrame.使用 isNotNull 或 isNull 将不起作用,因为它正在寻找 DataFrame 中的“空”值。 Your example DF does not contain null values but empty values, there is a difference there.您的示例 DF 不包含空值,而是包含空值,那里有所不同。

One option: You could create a new column that has the length of of the array and filter for if the array is zero.一种选择:您可以创建一个具有数组长度的新列,并过滤数组是否为零。

  val dfFil = df
    .withColumn("arrayLengthColOne", size($"Column1"))
    .withColumn("arrayLengthColTwo", size($"Column2"))
    .withColumn("arrayLengthColThree", size($"Column3"))
    .withColumn("arrayLengthColFour", size($"Column4"))
    .filter($"arrayLengthColOne" =!= 0 && $"arrayLengthColTwo" =!= 0 
    && $"arrayLengthColThree" =!= 0 && $"arrayLengthColFour" =!= 0)
    .drop("arrayLengthColOne", "arrayLengthColTwo", "arrayLengthColThree", "arrayLengthColFour")

Original DF:原始DF:

+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
|    [A]|    [B]|    [C]|    [d]|
|     []|     []|     []|     []|
+-------+-------+-------+-------+

New DF:新DF:

+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
|    [A]|    [B]|    [C]|    [d]|
+-------+-------+-------+-------+

You could also create a function that will map across all the columns and do it.您还可以创建一个函数来映射所有列并执行此操作。

Another approach (in addition to accepted answer) would be using Datasets .另一种方法(除了接受的答案)是使用Datasets
For example, by having a case class:例如,通过有一个案例类:

case class MyClass(col1: Seq[String],
                   col2: Seq[Double],
                   col3: Seq[Double],
                   col4: Seq[String]) { 
    def isEmpty: Boolean = ...
}

You can represent your source as a typed structure:您可以将源表示为类型化结构:

import spark.implicits._ // needed to provide an implicit encoder/data mapper 

val originalSource: DataFrame = ... // provide your source
val source: Dataset[MyClass] = originalSource.as[MyClass] // convert/map it to Dataset

So you could do filtering like following:所以你可以做如下过滤:

source.filter(element => !element.isEmpty) // calling class's instance method

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM