With a dataframe such as:
+-----+-----+-----+-----+-----+-----+
|old_a|new_a| a|old_b|new_b| b|
+-----+-----+-----+-----+-----+-----+
| 6| 7| true| 6| 6|false|
| 1| 1|false| 12| 8| true|
| 1| 2| true| 2| 8| true|
| 1| null| true| 2| 8| true|
+-----+-----+-----+-----+-----+-----+
note: 'a' is 'true' when 'new_a' is different from 'old_a', the same for 'b'
I'd like to add a json column, with some values from other columns, following that rule "if 'a' is true, value of 'new_a' col must be added to the new json, and the same for 'b'",
which will produce following dataframe
+-----+-----+--------+-----+-----+--------+------------------------+
|old_a|new_a|a |old_b|new_b| b| json |
+-----+-----+--------+-----+-----+--------+------------------------+
| 6| 7| true| 6| 6| false| { "a" : 7 } |
| 1| 1| false| 12| 8| true| { "b" : 8 } |
| 1| 2| true| 2| 8| true| { "a" : 2, "b" : 8} |
| 1| null| true| 2| 8| true| { "a" : null, "b" : 8} |
+-----+-----+--------+-----+-----+--------+------------------------+
Is there a way to achieve that without UDFs?
If not what would best way to write the UDF so it won't be too costly?
Thanks
Use to_json
& struct
functions.
By default to_json
function removes all null
value columns, due to this reason I have converted new_a
column datatype to string
new_a
datatype integer
scala> df.show(false)
+-----+-----+-----+-----+-----+-----+
|old_a|new_a|a |old_b|new_b|b |
+-----+-----+-----+-----+-----+-----+
|6 |7 |true |6 |6 |false|
|1 |1 |false|12 |8 |true |
|1 |2 |true |2 |8 |true |
|1 |null |true |2 |8 |true |
+-----+-----+-----+-----+-----+-----+
scala> df.printSchema
root
|-- old_a: integer (nullable = false)
|-- new_a: integer (nullable = true)
|-- a: boolean (nullable = false)
|-- old_b: integer (nullable = false)
|-- new_b: integer (nullable = false)
|-- b: boolean (nullable = false)
scala> df.withColumn("json",when($"a" && $"b",to_json(struct($"new_a",$"new_b"))).when($"a",to_json(struct($"new_a"))).otherwise(to_json(struct($"new_b")))).show(false)
+-----+-----+-----+-----+-----+-----+---------------------+
|old_a|new_a|a |old_b|new_b|b |json |
+-----+-----+-----+-----+-----+-----+---------------------+
|6 |7 |true |6 |6 |false|{"new_a":7} |
|1 |1 |false|12 |8 |true |{"new_b":8} |
|1 |2 |true |2 |8 |true |{"new_a":2,"new_b":8}|
|1 |null |true |2 |8 |true |{"new_b":8} |
+-----+-----+-----+-----+-----+-----+---------------------+
new_a
datatype string
scala> df.show(false)
+-----+-----+-----+-----+-----+-----+
|old_a|new_a|a |old_b|new_b|b |
+-----+-----+-----+-----+-----+-----+
|6 |7 |true |6 |6 |false|
|1 |1 |false|12 |8 |true |
|1 |2 |true |2 |8 |true |
|1 |null |true |2 |8 |true |
+-----+-----+-----+-----+-----+-----+
scala> df.printSchema
root
|-- old_a: integer (nullable = false)
|-- new_a: string (nullable = true)
|-- a: boolean (nullable = false)
|-- old_b: integer (nullable = false)
|-- new_b: integer (nullable = false)
|-- b: boolean (nullable = false)
scala> df.withColumn("json",when($"a" && $"b",to_json(struct($"new_a",$"new_b"))).when($"a",to_json(struct($"new_a"))).otherwise(to_json(struct($"new_b")))).show(false)
+-----+-----+-----+-----+-----+-----+--------------------------+
|old_a|new_a|a |old_b|new_b|b |json |
+-----+-----+-----+-----+-----+-----+--------------------------+
|6 |7 |true |6 |6 |false|{"new_a":"7"} |
|1 |1 |false|12 |8 |true |{"new_b":8} |
|1 |2 |true |2 |8 |true |{"new_a":"2","new_b":8} |
|1 |null |true |2 |8 |true |{"new_a":"null","new_b":8}|
+-----+-----+-----+-----+-----+-----+--------------------------+
A solution to generalize Srinivas solution, when we don't know the number of old/new column pairs
(note something I didn't mention is that col 'a' and 'b' where here to tell if the value changed between old a and new a (respectively b)
val df = Seq(
(null, "a", "b", "b"),
("a", null, "b", "b"),
("a", "a2", "b", "b"),
("a", "a2", "b", "b2"),
(null, null, "b", "b2"),
).toDF("old_a", "new_a","old_b", "new_b")
// replace null by empty string to not mess with the voluntary null value we set later
val df2 = df.na.fill("",df.columns)
df2.show()
val colNames = df2.columns.map(name => name.stripPrefix("old_").stripPrefix("new_")).distinct
val res = colNames.foldLeft(df2){(tempDF, colName) =>
tempDF.withColumn(colName,
when(col(s"old_$colName").equalTo(col(s"new_$colName")), null)
.otherwise(col(s"new_$colName"))
)
}
val cols: Array[Column] = colNames.map(col(_))
val resWithJson = res.withColumn("json", to_json(struct(cols:_*)))
output:
+-----+-----+-----+-----+
|old_a|new_a|old_b|new_b|
+-----+-----+-----+-----+
| | a| b| b|
| a| | b| b|
| a| a2| b| b|
| a| a2| b| b2|
| | | b| b2|
+-----+-----+-----+-----+
+-----+-----+-----+-----+----+----+-------------------+
|old_a|new_a|old_b|new_b|a |b |json |
+-----+-----+-----+-----+----+----+-------------------+
| |a |b |b |a |null|{"a":"a"} |
|a | |b |b | |null|{"a":""} |
|a |a2 |b |b |a2 |null|{"a":"a2"} |
|a |a2 |b |b2 |a2 |b2 |{"a":"a2","b":"b2"}|
| | |b |b2 |null|b2 |{"b":"b2"} |
+-----+-----+-----+-----+----+----+-------------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.