[英]Scala: Delete empty array values from a Spark DataFrame
I'm a new learner of Scala.我是 Scala 的新学习者。 Now given a DataFrame named df as follows:现在给出一个名为 df 的 DataFrame 如下:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]| [0.0]| [0.0]| [null]|
| [IND1]| [5.0]| [6.0]| [A]|
| [IND2]| [7.0]| [8.0]| [B]|
| []| []| []| []|
+-------+-------+-------+-------+
I'd like to delete rows if all columns is an empty array (4th row).如果所有列都是空数组(第 4 行),我想删除行。
For example I might expect the result is:例如,我可能期望结果是:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]| [0.0]| [0.0]| [null]|
| [IND1]| [5.0]| [6.0]| [A]|
| [IND2]| [7.0]| [8.0]| [B]|
+-------+-------+-------+-------+
I'm trying to use isNotNull (like val temp=df.filter(col("Column1").isNotNull && col("Column2").isNotNull && col("Column3").isNotNull && col("Column4").isNotNull).show()
) but still show all rows.我正在尝试使用 isNotNull (例如val temp=df.filter(col("Column1").isNotNull && col("Column2").isNotNull && col("Column3").isNotNull && col("Column4").isNotNull).show()
)但仍显示所有行。
I found python solution of using a Hive UDF from link , but I had hard time trying to convert to a valid scala code.我从link找到了使用 Hive UDF 的 python 解决方案,但我很难尝试转换为有效的 scala 代码。 I would like use scala command similar to the following code:我想使用类似于以下代码的 scala 命令:
val query = "SELECT * FROM targetDf WHERE {0}".format(" AND ".join("SIZE({0}) > 0".format(c) for c in ["Column1", "Column2", "Column3","Column4"]))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.sql(query)
Any help would be appreciated.任何帮助,将不胜感激。 Thank you.谢谢你。
Using the isNotNull or isNull will not work because it is looking for a 'null' value in the DataFrame.使用 isNotNull 或 isNull 将不起作用,因为它正在寻找 DataFrame 中的“空”值。 Your example DF does not contain null values but empty values, there is a difference there.您的示例 DF 不包含空值,而是包含空值,那里有所不同。
One option: You could create a new column that has the length of of the array and filter for if the array is zero.一种选择:您可以创建一个具有数组长度的新列,并过滤数组是否为零。
val dfFil = df
.withColumn("arrayLengthColOne", size($"Column1"))
.withColumn("arrayLengthColTwo", size($"Column2"))
.withColumn("arrayLengthColThree", size($"Column3"))
.withColumn("arrayLengthColFour", size($"Column4"))
.filter($"arrayLengthColOne" =!= 0 && $"arrayLengthColTwo" =!= 0
&& $"arrayLengthColThree" =!= 0 && $"arrayLengthColFour" =!= 0)
.drop("arrayLengthColOne", "arrayLengthColTwo", "arrayLengthColThree", "arrayLengthColFour")
Original DF:原始DF:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [A]| [B]| [C]| [d]|
| []| []| []| []|
+-------+-------+-------+-------+
New DF:新DF:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [A]| [B]| [C]| [d]|
+-------+-------+-------+-------+
You could also create a function that will map across all the columns and do it.您还可以创建一个函数来映射所有列并执行此操作。
Another approach (in addition to accepted answer) would be using Datasets .另一种方法(除了接受的答案)是使用Datasets 。
For example, by having a case class:例如,通过有一个案例类:
case class MyClass(col1: Seq[String],
col2: Seq[Double],
col3: Seq[Double],
col4: Seq[String]) {
def isEmpty: Boolean = ...
}
You can represent your source as a typed structure:您可以将源表示为类型化结构:
import spark.implicits._ // needed to provide an implicit encoder/data mapper
val originalSource: DataFrame = ... // provide your source
val source: Dataset[MyClass] = originalSource.as[MyClass] // convert/map it to Dataset
So you could do filtering like following:所以你可以做如下过滤:
source.filter(element => !element.isEmpty) // calling class's instance method
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.