从Spark数据框列中ArrayType类型的行中获取不同的元素

Question

I have a dataframe with the following schema: 我有一个具有以下架构的数据框：

    root
     |-- e: array (nullable = true)
     |    |-- element: string (containsNull = true)

For example, initiate a dataframe: 例如，启动一个数据框：

val df = Seq(Seq("73","73"), null, null, null, Seq("51"), null, null, null, Seq("52", "53", "53", "73", "84"), Seq("73", "72", "51", "73")).toDF("e")

df.show()

+--------------------+
|                   e|
+--------------------+
|            [73, 73]|
|                null|
|                null|
|                null|
|                [51]|
|                null|
|                null|
|                null|
|[52, 53, 53, 73, 84]|
|    [73, 72, 51, 73]|
+--------------------+

I'd like the output to be: 我希望输出为：

+--------------------+
|                   e|
+--------------------+
|                [73]|
|                null|
|                null|
|                null|
|                [51]|
|                null|
|                null|
|                null|
|    [52, 53, 73, 84]|
|        [73, 72, 51]|
+--------------------+

I am trying the following udf: 我正在尝试以下udf：

def distinct(arr: TraversableOnce[String])=arr.toList.distinct
val distinctUDF=udf(distinct(_:Traversable[String]))

But it only works when the rows aren't null ie 但是它仅在行不为空时起作用，即

df.filter($"e".isNotNull).select(distinctUDF($"e"))

gives me 给我

+----------------+
|          UDF(e)|
+----------------+
|            [73]|
|            [51]|
|[52, 53, 73, 84]|
|    [73, 72, 51]|
+----------------+

but 但

df.select(distinctUDF($"e"))

fails. 失败。 How do I make the udf handle null in this case? 在这种情况下，如何使udf句柄为null？ Alternatively, if there's a simpler way of getting the unique values, I'd like to try that. 或者，如果有一种更简单的方法来获取唯一值，我想尝试一下。

Answer 1

You can make use of when().otherwise() to apply your UDF only when the column value is not null . 仅当列值不为null时，才可以使用when().otherwise()来应用UDF。 In this case, .otherwise(null) can also be skipped, as it defaults to null when not specifying the otherwise clause. 在这种情况下， .otherwise(null)也可以被跳过，因为它默认为null时不指定otherwise子句。

val distinctUDF = udf( (s: Seq[String]) => s.distinct )

df.select(when($"e".isNotNull, distinctUDF($"e")).as("e"))

从Spark数据框列中ArrayType类型的行中获取不同的元素

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-09-13 23:38:31

从Spark数据框列中ArrayType类型的行中获取不同的元素

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-09-13 23:38:31

解决方案1
1 已采纳 2018-09-13 23:38:31