简体   繁体   English

从Spark数据框列中ArrayType类型的行中获取不同的元素

[英]Get distinct elements from rows of type ArrayType in Spark dataframe column

I have a dataframe with the following schema: 我有一个具有以下架构的数据框:

    root
     |-- e: array (nullable = true)
     |    |-- element: string (containsNull = true)

For example, initiate a dataframe: 例如,启动一个数据框:

val df = Seq(Seq("73","73"), null, null, null, Seq("51"), null, null, null, Seq("52", "53", "53", "73", "84"), Seq("73", "72", "51", "73")).toDF("e")

df.show()

+--------------------+
|                   e|
+--------------------+
|            [73, 73]|
|                null|
|                null|
|                null|
|                [51]|
|                null|
|                null|
|                null|
|[52, 53, 53, 73, 84]|
|    [73, 72, 51, 73]|
+--------------------+

I'd like the output to be: 我希望输出为:

+--------------------+
|                   e|
+--------------------+
|                [73]|
|                null|
|                null|
|                null|
|                [51]|
|                null|
|                null|
|                null|
|    [52, 53, 73, 84]|
|        [73, 72, 51]|
+--------------------+

I am trying the following udf: 我正在尝试以下udf:

def distinct(arr: TraversableOnce[String])=arr.toList.distinct
val distinctUDF=udf(distinct(_:Traversable[String]))

But it only works when the rows aren't null ie 但是它仅在行不为空时起作用,即

df.filter($"e".isNotNull).select(distinctUDF($"e")) 

gives me 给我

+----------------+
|          UDF(e)|
+----------------+
|            [73]|
|            [51]|
|[52, 53, 73, 84]|
|    [73, 72, 51]|
+----------------+

but

df.select(distinctUDF($"e")) 

fails. 失败。 How do I make the udf handle null in this case? 在这种情况下,如何使udf句柄为null? Alternatively, if there's a simpler way of getting the unique values, I'd like to try that. 或者,如果有一种更简单的方法来获取唯一值,我想尝试一下。

You can make use of when().otherwise() to apply your UDF only when the column value is not null . 仅当列值不为null时,才可以使用when().otherwise()来应用UDF。 In this case, .otherwise(null) can also be skipped, as it defaults to null when not specifying the otherwise clause. 在这种情况下, .otherwise(null)也可以被跳过,因为它默认为null时不指定otherwise子句。

val distinctUDF = udf( (s: Seq[String]) => s.distinct )

df.select(when($"e".isNotNull, distinctUDF($"e")).as("e"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM