如何使用 udf 更新包含數組的 spark dataframe 列

Question

我有一個 dataframe：

+--------------------+------+
|people              |person|
+--------------------+------+
|[[jack, jill, hero]]|joker |
+--------------------+------+

它的架構：

root
 |-- people: struct (nullable = true)
 |    |-- person: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |-- person: string (nullable = true)

這里，root--person 是一個字符串。 所以，我可以使用 udf 更新這個字段：

def updateString = udf((s: String) => {
    "Mr. " + s
})

df.withColumn("person", updateString(col("person"))).select("person").show(false)

output：

+---------+
|person   |
+---------+
|Mr. joker|
+---------+

我想對包含人員數組的 root--people--person 列執行相同的操作。 如何使用udf實現這一點？

def updateArray = udf((arr: Seq[Row]) => ???

df.withColumn("people", updateArray(col("people.person"))).select("people").show(false)

預期的：

+------------------------------+
|people                        |
+------------------------------+
|[Mr. hero, Mr. jack, Mr. jill]|
+------------------------------+

編輯：我還想在更新 root--people--person 之后保留它的模式。

人的預期模式：

df.select("people").printSchema()

root
 |-- people: struct (nullable = false)
 |    |-- person: array (nullable = true)
 |    |    |-- element: string (containsNull = true)

謝謝，

Answer 1

因為您只需要更新您的 function 並且一切都保持不變。 這是代碼片段。

scala> df2.show
+------+------------------+
|people|            person|
+------+------------------+
| joker|[jack, jill, hero]|
+------+------------------+
//jus order is changed
I just updated your function instead of using Row I am using here Seq[String]

scala> def updateArray = udf((arr: Seq[String]) => arr.map(x=>"Mr."+x))
scala> df2.withColumn("test",updateArray($"person")).show(false)
+------+------------------+---------------------------+
|people|person            |test                       |
+------+------------------+---------------------------+
|joker |[jack, jill, hero]|[Mr.jack, Mr.jill, Mr.hero]|
+------+------------------+---------------------------+
//keep all the column for testing purpose you could drop if you dont want.

如果您想了解更多信息，請告訴我。

Answer 2

這里的問題是people是只有 1 個字段的結構。 在您的 UDF 中，您需要返回Tuple1然后進一步轉換您的 UDF 的 output 以保持名稱正確：

def updateArray = udf((r: Row) => Tuple1(r.getAs[Seq[String]](0).map(x=>"Mr."+x)))

val newDF = df
  .withColumn("people",updateArray($"people").cast("struct<person:array<string>>"))

newDF.printSchema()
newDF.show()

給

root
 |-- people: struct (nullable = true)
 |    |-- person: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |-- person: string (nullable = true)


+--------------------+------+
|              people|person|
+--------------------+------+
|[[Mr.jack, Mr.jil...| joker|
+--------------------+------+

Answer 3

讓我們創建用於測試的數據

scala> val data = Seq((List(Array("ja", "ji", "he")), "person")).toDF("people", "person")
data: org.apache.spark.sql.DataFrame = [people: array<array<string>>, person: string]

scala> data.printSchema
root
 |-- people: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- person: string (nullable = true)

根據我們的要求創建 UDF

scala> def arrayConcat(array:Seq[Seq[String]], str: String) = array.map(_.map(str + _))
arrayConcat: (array: Seq[Seq[String]], str: String)Seq[Seq[String]]

scala> val arrayConcatUDF = udf(arrayConcat _)
arrayConcatUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true), StringType)))

應用 udf

scala> data.withColumn("dasd", arrayConcatUDF($"people", lit("Mr."))).show(false)
+--------------------------+------+-----------------------------------+
|people                    |person|dasd                               |
+--------------------------+------+-----------------------------------+
|[WrappedArray(ja, ji, he)]|person|[WrappedArray(Mr.ja, Mr.ji, Mr.he)]|
+--------------------------+------+-----------------------------------+

您可能需要稍微調整一下（我認為幾乎不需要任何調整），但這包含了解決您問題的大部分內容

Answer 4

如果您使用的是 Spark >= 2.4.0，則不需要使用 UDF。 您可以利用transform和concat如下：

import org.apache.spark.sql.functions.expr

val df = Seq(
  (Seq("jack", "jill", "hero"), "joker")
).toDF("people", "person")

df.select(
  expr("transform(people, x -> concat('Mr.', x))").as("people"), $"person"
).show(false)

// +---------------------------+------+
// |people                     |person|
// +---------------------------+------+
// |[Mr.jack, Mr.jill, Mr.hero]|joker |
// +---------------------------+------+

如何使用 udf 更新包含數組的 spark dataframe 列

問題描述

3 個解決方案

解決方案1
1 2019-10-29 07:26:58

解決方案2
1 已采納 2019-10-29 12:51:52

解決方案3
0 2019-10-29 07:15:23

解決方案4
0 2019-10-29 17:39:06

如何使用 udf 更新包含數組的 spark dataframe 列

問題描述

3 個解決方案

解決方案1 1 2019-10-29 07:26:58

解決方案2 1 已采納 2019-10-29 12:51:52

解決方案3 0 2019-10-29 07:15:23

解決方案4 0 2019-10-29 17:39:06

解決方案1
1 2019-10-29 07:26:58

解決方案2
1 已采納 2019-10-29 12:51:52

解決方案3
0 2019-10-29 07:15:23

解決方案4
0 2019-10-29 17:39:06