![](/img/trans.png)
[英]Add new column with literal value to a struct column in Dataframe in Spark Scala
[英]update a dataframe struct column value in spark scala
我有以下架构的数据框如下:
root
|-- col1: string (nullable = true)
|-- col2: integer (nullable = true)
|-- colStruct: struct (nullable = true)
| |-- subCol1: integer (nullable = true)
| |-- subCol2: string (nullable = true)
|-- subCol3: integer (nullable = true)
如何使用 UDF 更新subCol1
和subCol3
列值?
使用.(dot)
表示法访问嵌套列。
这是一个例子:
case class Details(height: Integer, weight: Integer, sex: String) // height in cms, weight in lbs
case class Person(name: String, age: Integer, details: Details)
println("The following is our dataset")
val data = Seq(
Person("Darth Vader", 80, Details(180, 200, "male")),
Person("Luke Skywalker", 25, Details(185, 180, "male")),
Person("Obi-Wan Kenobe", 50, Details(175, 175, "male")),
Person("Princess Leia", 23, Details(165, 150, "female")),
).toDF.cache()
data.show(5, false)
println("The schema of our data is:")
data.printSchema()
/*
The following is our dataset
+--------------+---+------------------+
|name |age|details |
+--------------+---+------------------+
|Darth Vader |80 |{180, 200, male} |
|Luke Skywalker|25 |{185, 180, male} |
|Obi-Wan Kenobe|50 |{175, 175, male} |
|Princess Leia |23 |{165, 150, female}|
+--------------+---+------------------+
The schema of our data is:
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- details: struct (nullable = true)
| |-- height: integer (nullable = true)
| |-- weight: integer (nullable = true)
| |-- sex: string (nullable = true)
*/
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
// list out the columns you want to update using .(dot) notation
val allNestedColumnNamesToUpdate = Seq("details.height", "details.weight")
// list out all nested columns
val allNestedColumnNames = Seq("height", "weight", "sex")
// create your UDFs. Here we have created one for each integer nested column
val updateHeight = (value: Int) => { if (value < 180) 190 else 170 }
val updateWeight = (value: Int) => { if (value < 180) 190 else 170 }
// register UDFs
val updateHeightUDF = spark.udf.register("updateHeightUDF", updateHeight)
val updateWeightUDF = spark.udf.register("updateWeightUDF", updateWeight)
// Map the name of the nested column to update to it's UDF
val columnNameToUpdateToUDFMap = Map (
"details.height" -> updateHeightUDF,
"details.weight" -> updateWeightUDF
)
val updatedDF = allNestedColumnNamesToUpdate.foldLeft(data)((acc, columnNameToUpdate) => {
val udf = columnNameToUDFMap(columnNameToUpdate)
val updatedStructColumns = allNestedColumnNames.map(x => {
if(x == columnNameToUpdate) lit(udf(col(columnNameToUpdate))).as(columnNameToUpdate)
else col(s"details.$x")
})
df.withColumn("details", struct(updatedStructColumns: _*))
})
updatedDF.show()
/*
+--------------+---+------------------+
| name|age| details|
+--------------+---+------------------+
| Darth Vader| 80| {170, 170, male}|
|Luke Skywalker| 25| {170, 170, male}|
|Obi-Wan Kenobe| 50| {190, 190, male}|
| Princess Leia| 23|{190, 190, female}|
+--------------+---+------------------+
*/
注意:不推荐使用 UDF,因为它们对 Spark 的优化器不可见。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.