简体   繁体   中英

Spark scala - add a json column from other columns conditionally

With a dataframe such as:

+-----+-----+-----+-----+-----+-----+
|old_a|new_a|    a|old_b|new_b|    b|
+-----+-----+-----+-----+-----+-----+
|    6|    7| true|    6|    6|false|
|    1|    1|false|   12|    8| true|
|    1|    2| true|    2|    8| true|
|    1| null| true|    2|    8| true|
+-----+-----+-----+-----+-----+-----+

note: 'a' is 'true' when 'new_a' is different from 'old_a', the same for 'b'

I'd like to add a json column, with some values from other columns, following that rule "if 'a' is true, value of 'new_a' col must be added to the new json, and the same for 'b'",

which will produce following dataframe

+-----+-----+--------+-----+-----+--------+------------------------+
|old_a|new_a|a       |old_b|new_b|       b| json                   |
+-----+-----+--------+-----+-----+--------+------------------------+
|    6|    7|    true|    6|    6|   false| { "a" : 7 }            |
|    1|    1|   false|   12|    8|    true| { "b" : 8 }            |    
|    1|    2|    true|    2|    8|    true| { "a" : 2, "b" : 8}    |
|    1| null|    true|    2|    8|    true| { "a" : null, "b" : 8} |
+-----+-----+--------+-----+-----+--------+------------------------+

Is there a way to achieve that without UDFs?

If not what would best way to write the UDF so it won't be too costly?

Thanks

Use to_json & struct functions.

By default to_json function removes all null value columns, due to this reason I have converted new_a column datatype to string

new_a datatype integer

scala> df.show(false)
+-----+-----+-----+-----+-----+-----+
|old_a|new_a|a    |old_b|new_b|b    |
+-----+-----+-----+-----+-----+-----+
|6    |7    |true |6    |6    |false|
|1    |1    |false|12   |8    |true |
|1    |2    |true |2    |8    |true |
|1    |null |true |2    |8    |true |
+-----+-----+-----+-----+-----+-----+


scala> df.printSchema
root
 |-- old_a: integer (nullable = false)
 |-- new_a: integer (nullable = true)
 |-- a: boolean (nullable = false)
 |-- old_b: integer (nullable = false)
 |-- new_b: integer (nullable = false)
 |-- b: boolean (nullable = false)


scala> df.withColumn("json",when($"a" && $"b",to_json(struct($"new_a",$"new_b"))).when($"a",to_json(struct($"new_a"))).otherwise(to_json(struct($"new_b")))).show(false)
+-----+-----+-----+-----+-----+-----+---------------------+
|old_a|new_a|a    |old_b|new_b|b    |json                 |
+-----+-----+-----+-----+-----+-----+---------------------+
|6    |7    |true |6    |6    |false|{"new_a":7}          |
|1    |1    |false|12   |8    |true |{"new_b":8}          |
|1    |2    |true |2    |8    |true |{"new_a":2,"new_b":8}|
|1    |null |true |2    |8    |true |{"new_b":8}          |
+-----+-----+-----+-----+-----+-----+---------------------+

new_a datatype string

scala> df.show(false)
+-----+-----+-----+-----+-----+-----+
|old_a|new_a|a    |old_b|new_b|b    |
+-----+-----+-----+-----+-----+-----+
|6    |7    |true |6    |6    |false|
|1    |1    |false|12   |8    |true |
|1    |2    |true |2    |8    |true |
|1    |null |true |2    |8    |true |
+-----+-----+-----+-----+-----+-----+


scala> df.printSchema
root
 |-- old_a: integer (nullable = false)
 |-- new_a: string (nullable = true)
 |-- a: boolean (nullable = false)
 |-- old_b: integer (nullable = false)
 |-- new_b: integer (nullable = false)
 |-- b: boolean (nullable = false)


scala> df.withColumn("json",when($"a" && $"b",to_json(struct($"new_a",$"new_b"))).when($"a",to_json(struct($"new_a"))).otherwise(to_json(struct($"new_b")))).show(false)
+-----+-----+-----+-----+-----+-----+--------------------------+
|old_a|new_a|a    |old_b|new_b|b    |json                      |
+-----+-----+-----+-----+-----+-----+--------------------------+
|6    |7    |true |6    |6    |false|{"new_a":"7"}             |
|1    |1    |false|12   |8    |true |{"new_b":8}               |
|1    |2    |true |2    |8    |true |{"new_a":"2","new_b":8}   |
|1    |null |true |2    |8    |true |{"new_a":"null","new_b":8}|
+-----+-----+-----+-----+-----+-----+--------------------------+

A solution to generalize Srinivas solution, when we don't know the number of old/new column pairs

(note something I didn't mention is that col 'a' and 'b' where here to tell if the value changed between old a and new a (respectively b)

   val df = Seq(
      (null, "a", "b", "b"),
      ("a", null, "b", "b"),
      ("a", "a2", "b", "b"),
      ("a", "a2", "b", "b2"),
      (null, null, "b", "b2"),
    ).toDF("old_a", "new_a","old_b", "new_b")

    // replace null by empty string to not mess with the voluntary null value we set later
    val df2 = df.na.fill("",df.columns)

    df2.show()

    val colNames = df2.columns.map(name => name.stripPrefix("old_").stripPrefix("new_")).distinct
    val res = colNames.foldLeft(df2){(tempDF, colName) =>
      tempDF.withColumn(colName,
        when(col(s"old_$colName").equalTo(col(s"new_$colName")), null)
        .otherwise(col(s"new_$colName"))
      )
    }
    val cols: Array[Column] = colNames.map(col(_))
    val resWithJson = res.withColumn("json", to_json(struct(cols:_*)))

output:

+-----+-----+-----+-----+
|old_a|new_a|old_b|new_b|
+-----+-----+-----+-----+
|     |    a|    b|    b|
|    a|     |    b|    b|
|    a|   a2|    b|    b|
|    a|   a2|    b|   b2|
|     |     |    b|   b2|
+-----+-----+-----+-----+

+-----+-----+-----+-----+----+----+-------------------+
|old_a|new_a|old_b|new_b|a   |b   |json               |
+-----+-----+-----+-----+----+----+-------------------+
|     |a    |b    |b    |a   |null|{"a":"a"}          |
|a    |     |b    |b    |    |null|{"a":""}           |
|a    |a2   |b    |b    |a2  |null|{"a":"a2"}         |
|a    |a2   |b    |b2   |a2  |b2  |{"a":"a2","b":"b2"}|
|     |     |b    |b2   |null|b2  |{"b":"b2"}         |
+-----+-----+-----+-----+----+----+-------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM