I am Updating a Delta Table with some incremental records. Two of the fields require just a plain update, but there is another one which is a collection of maps which I would like to concatenate all the existing values instead of doing a update/replace
val historicalDF = Seq(
(1, 0, "Roger", Seq(Map("score" -> 5, "year" -> 2012)))
).toDF("id", "ts", "user", "scores")
historicalDF.write
.format("delta")
.mode("overwrite")
.save(table_path)
val hist_dt : DeltaTable = DeltaTable.forPath(spark, table_path)
val incrementalDF = Seq(
(1, 1, "Roger Rabbit", Seq(Map("score" -> 7, "year" -> 2013)))
).toDF("id", "ts", "user", "scores")
What I would like to have after the merge something is like this:
+---+---+------------+--------------------------------------------------------+
|id |ts |user |scores |
+---+---+------------+--------------------------------------------------------+
|1 |1 |Roger Rabbit|[{score -> 7, year -> 2013}, {score -> 7, year -> 2013}]|
+---+---+------------+--------------------------------------------------------+
What I tried to perform this concatenation is:
hist_dt
.as("ex")
.merge(incrementalDF.as("in"),
"ex.id = in.id")
.whenMatched
.updateExpr(
Map(
"ts" -> "in.ts",
"user" -> "in.user",
"scores" -> "in.scores" ++ "ex.scores"
)
)
.whenNotMatched
.insertAll()
.execute()
But the columns "in.scores"
and "ex.scores"
are interpreted as String
, so I am getting the following error:
error: value ++ is not a member of (String, String)
If there a way to add some complex logic to updateExpr
?
Using update()
instead of updateExpr()
let me pass the required columns to a udf, so I can add there a more complex logic
def join_seq_map(incremental: Seq[Map[String,Integer]], existing: Seq[Map[String,Integer]]) : Seq[Map[String,Integer]] = {
(incremental, existing) match {
case ( null , null) => null
case ( null, e ) => e
case ( i , null) => i
case ( i , e ) => (i ++ e).distinct
}
}
def join_seq_map_udf = udf(join_seq_map _)
hist_dt
.as("ex")
.merge(
incrementalDF.as("in"),
"ex.id = in.id")
.whenMatched("ex.ts < in.ts")
.update(Map(
"ts" -> col("in.ts"),
"user" -> col("in.user"),
"scores" -> join_seq_map_udf(col("in.scores"), col("ex.scores"))
))
.whenNotMatched
.insertAll()
.execute()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.