简体   繁体   中英

Spark dataframe from Json string with nested key

I have several columns to be extracted from json string. However one field has nested values. Not sure how to deal with that?

Need to explode into multiple rows to get values of field name, Value1, Value2.

import spark.implicits._

val df = Seq(
  ("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
  ("2", """{"p": "bar", "q": 3.0}""", "some_other_field_2"),
  ("3",
    """{"nestedKey":[ {"field name":"name1","Value1":false,"Value2":true},
      |                 {"field name":"name2","Value1":"100","Value2":"200"}
      |]}""".stripMargin, "some_other_field_3")

).toDF("id","json","other")

df.show(truncate = false)
val df1= df.withColumn("id1",col("id"))
  .withColumn("other1",col("other"))
  .withColumn("k",get_json_object(col("json"),"$.k"))
  .withColumn("v",get_json_object(col("json"),"$.v"))
  .withColumn("p",get_json_object(col("json"),"$.p"))
  .withColumn("q",get_json_object(col("json"),"$.q"))
  .withColumn("nestedKey",get_json_object(col("json"),"$.nestedKey"))
    .select("id1","other1","k","v","p","q","nestedKey")
df1.show(truncate = false)

You can parse the nestedKey using from_json and explode it:

val df2 = df1.withColumn(
    "nestedKey", 
    expr("explode_outer(from_json(nestedKey, 'array<struct<`field name`:string, Value1:string, Value2:string>>'))")
).select("*", "nestedKey.*").drop("nestedKey")

df2.show
+---+------------------+----+----+----+----+----------+------+------+
|id1|            other1|   k|   v|   p|   q|field name|Value1|Value2|
+---+------------------+----+----+----+----+----------+------+------+
|  1|some_other_field_1| foo| 1.0|null|null|      null|  null|  null|
|  2|some_other_field_2|null|null| bar| 3.0|      null|  null|  null|
|  3|some_other_field_3|null|null|null|null|     name1| false|  true|
|  3|some_other_field_3|null|null|null|null|     name2|   100|   200|
+---+------------------+----+----+----+----+----------+------+------+

i did it in one dataframe

 val df1= df.withColumn("id1",col("id"))
    .withColumn("other1",col("other"))
    .withColumn("k",get_json_object(col("json"),"$.k"))
    .withColumn("v",get_json_object(col("json"),"$.v"))
    .withColumn("p",get_json_object(col("json"),"$.p"))
    .withColumn("q",get_json_object(col("json"),"$.q"))
    .withColumn("nestedKey",get_json_object(col("json"),"$.nestedKey"))
  .withColumn(
    "nestedKey",
    expr("explode_outer(from_json(nestedKey, 'array<struct<`field name`:string, Value1:string, Value2:string>>'))")
  ).withColumn("fieldname",col("nestedKey.field name"))
    .withColumn("valueone",col("nestedKey.Value1"))
    .withColumn("valuetwo",col("nestedKey.Value2"))
   .select("id1","other1","k","v","p","q","fieldname","valueone","valuetwo")```


still working to make it more elegant

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM