繁体   English   中英

更新 dataframe spark 中的结构列值 scala

[英]update a dataframe struct column value in spark scala

我有以下架构的数据框如下:

root
 |-- col1: string (nullable = true)
 |-- col2: integer (nullable = true)
 |-- colStruct: struct (nullable = true)
 |    |-- subCol1: integer (nullable = true)
 |    |-- subCol2: string (nullable = true)
      |-- subCol3: integer (nullable = true)

如何使用 UDF 更新subCol1subCol3列值?

使用.(dot)表示法访问嵌套列。

这是一个例子:

数据

case class Details(height: Integer, weight: Integer, sex: String) // height in cms, weight in lbs
case class Person(name: String, age: Integer, details: Details)

println("The following is our dataset")
val data = Seq(
  Person("Darth Vader", 80, Details(180, 200, "male")),
  Person("Luke Skywalker", 25, Details(185, 180, "male")),
  Person("Obi-Wan Kenobe", 50, Details(175, 175, "male")),
  Person("Princess Leia", 23, Details(165, 150, "female")),
).toDF.cache()
data.show(5, false)

println("The schema of our data is:")
data.printSchema()


/*
The following is our dataset
+--------------+---+------------------+
|name          |age|details           |
+--------------+---+------------------+
|Darth Vader   |80 |{180, 200, male}  |
|Luke Skywalker|25 |{185, 180, male}  |
|Obi-Wan Kenobe|50 |{175, 175, male}  |
|Princess Leia |23 |{165, 150, female}|
+--------------+---+------------------+

The schema of our data is:
root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- details: struct (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- weight: integer (nullable = true)
 |    |-- sex: string (nullable = true)
*/

更新嵌套列

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql._

// list out the columns you want to update using .(dot) notation
val allNestedColumnNamesToUpdate = Seq("details.height", "details.weight")
// list out all nested columns
val allNestedColumnNames = Seq("height", "weight", "sex")

// create your UDFs. Here we have created one for each integer nested column
val updateHeight = (value: Int) => { if (value < 180) 190 else 170 }
val updateWeight = (value: Int) => { if (value < 180) 190 else 170 }
// register UDFs
val updateHeightUDF = spark.udf.register("updateHeightUDF", updateHeight)
val updateWeightUDF = spark.udf.register("updateWeightUDF", updateWeight)

// Map the name of the nested column to update to it's UDF
val columnNameToUpdateToUDFMap = Map (
  "details.height" -> updateHeightUDF,
  "details.weight" -> updateWeightUDF
)

val updatedDF = allNestedColumnNamesToUpdate.foldLeft(data)((acc, columnNameToUpdate) => {
  val udf = columnNameToUDFMap(columnNameToUpdate)
  val updatedStructColumns = allNestedColumnNames.map(x => {
    if(x == columnNameToUpdate) lit(udf(col(columnNameToUpdate))).as(columnNameToUpdate)
    else col(s"details.$x")
  })
  df.withColumn("details", struct(updatedStructColumns: _*))
})

updatedDF.show()
/*
+--------------+---+------------------+
|          name|age|           details|
+--------------+---+------------------+
|   Darth Vader| 80|  {170, 170, male}|
|Luke Skywalker| 25|  {170, 170, male}|
|Obi-Wan Kenobe| 50|  {190, 190, male}|
| Princess Leia| 23|{190, 190, female}|
+--------------+---+------------------+
*/

注意:不推荐使用 UDF,因为它们对 Spark 的优化器不可见。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM