简体   繁体   English

将 Spark 的数据帧的 Json 列转换为 Object 的数组

[英]Convert a Spark's Data-frame's Json column to Array of Object

I have a dataframe with JSON column.我有一个 dataframe 和 JSON 列。 JSON basically contains array of key and value as in below example. JSON 基本上包含键和值的数组,如下例所示。

Col1
=====================================================================
|{“Name”:”Ram”,”Place”:”RamGarh”}                                    |
|{“Name”:”Lakshman”,”Place”:”LakshManPur”.”DepartMent”:”Operations”} |
|{“Name”:”Sita”,”Place”:”SitaPur”,”Experience”,”14”}                 |

I need to parse this JSON data.我需要解析这个 JSON 数据。 What should be most efficient way?什么应该是最有效的方法?

I need to present it form of我需要呈现它的形式

case class dfCol(col:String, valu:String)

So basically I need to parse json of every row of that dataframe and convert in form所以基本上我需要解析 dataframe 的每一行的 json 并转换为形式

 |   Col
 |   ==========================================================
 |   Array(dfCol(Name,Ram),dfCOl(Place,Ramgarh))
 |   Array(dfCol(Name,Lakshman),dfCOl(Place,LakshManPur),dfCOl(DepartMent,Operations))
 |   Array(dfCol(Name,Sita),dfCOl(Place,SitaPur),dfCOl(Experience,14))

Use this -用这个 -

case class dfCol(col:String, valu:String)

Load the test data provided加载提供的测试数据

val data =
      """
        |{"Name":"Ram","Place":"RamGarh"}
        |{"Name":"Lakshman","Place":"LakshManPur","DepartMent":"Operations"}
        |{"Name":"Sita","Place":"SitaPur","Experience":14.0}
      """.stripMargin
    val df = spark.read.json(data.split(System.lineSeparator()).toSeq.toDS())
    df.show(false)
    df.printSchema()
    /**
      * +----------+----------+--------+-----------+
      * |DepartMent|Experience|Name    |Place      |
      * +----------+----------+--------+-----------+
      * |null      |null      |Ram     |RamGarh    |
      * |Operations|null      |Lakshman|LakshManPur|
      * |null      |14.0      |Sita    |SitaPur    |
      * +----------+----------+--------+-----------+
      *
      * root
      * |-- DepartMent: string (nullable = true)
      * |-- Experience: double (nullable = true)
      * |-- Name: string (nullable = true)
      * |-- Place: string (nullable = true)
      */

Convert Row -> Array[dfCol]转换Row -> Array[dfCol]

   val ds: Dataset[Array[dfCol]] = df.map(row => {
      row.getValuesMap[String](row.schema.map(_.name))
        .filter(_._2 != null)
        .map{f => dfCol(f._1, String.valueOf(f._2))}
        .toArray
    })
    ds.show(false)
    ds.printSchema()

    // +------------------------------------------------------------------+
    //|value                                                             |
    //+------------------------------------------------------------------+
    //|[[Name, Ram], [Place, RamGarh]]                                   |
    //|[[DepartMent, Operations], [Name, Lakshman], [Place, LakshManPur]]|
    //|[[Experience, 14.0], [Name, Sita], [Place, SitaPur]]              |
    //+------------------------------------------------------------------+
    //
    //root
    // |-- value: array (nullable = true)
    // |    |-- element: struct (containsNull = true)
    // |    |    |-- col: string (nullable = true)
    // |    |    |-- valu: string (nullable = true)

Check below code.检查下面的代码。

scala> import org.apache.spark.sql.types._

scala> val schema = MapType[StringType,StringType]

scala> df.show(false)
+-------------------------------------------------------------------+
|col1                                                               |
+-------------------------------------------------------------------+
|{"Name":"Ram","Place":"RamGarh"}                                   |
|{"Name":"Lakshman","Place":"LakshManPur","DepartMent":"Operations"}|
|{"Name":"Sita","Place":"SitaPur","Experience":"14"}                |
+-------------------------------------------------------------------+


scala> 

df
.withColumn("id",monotonically_increasing_id)
.select(from_json($"col1",schema).as("col1"),$"id")
.select(explode($"col1"),$"id")
.groupBy($"id")
.agg(collect_list(struct($"key",$"value")).as("col1"))
.select("col1")
.show(false)

+------------------------------------------------------------------+
|col1                                                              |
+------------------------------------------------------------------+
|[[Name, Ram], [Place, RamGarh]]                                   |
|[[Name, Lakshman], [Place, LakshManPur], [DepartMent, Operations]]|
|[[Name, Sita], [Place, SitaPur], [Experience, 14]]                |
+------------------------------------------------------------------+
scala> df.withColumn("id",monotonically_increasing_id).select(from_json($"col1",schema).as("col1"),$"id").select(explode($"col1"),$"id").groupBy($"id").agg(collect_list(struct($"key",$"value")).as("col1")).select("col1").printSchema
root
 |-- col1: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = false)
 |    |    |-- value: string (nullable = true)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 scala spark 将具有 json 值的列转换为数据帧 - convert a column with json value to a data frame using scala spark 使用.to_json将数据帧转换为JSON会弄乱日期时间数据 - using .to_json to convert a data-frame to JSON messes up datetime data 如何将这样的嵌套 JSON 转换为数据框? 我尝试使用熊猫 json_normalize 但仍然没有得到正确的数据框 - How to convert nested JSON like this to a Data-frame? I tried using pandas json_normalize but still doesn't get a proper Data-frame 将结构转换为火花数据框中的数组 - convert struct to array in spark data frame 从MySQL JSON列中将JSON对象的值提取为数组 - Extract JSON object's values as array from MySQL JSON column 如何将数据数组对象转换为文件 JSON 并使用该文件向 ReactJS 中的 S3 服务器发送请求? - How to convert data array object to file JSON and use this file to send requests to the S3 server in ReactJS? Spark - 将 JSON 数组对象转换为连接字符串 - Spark - convert JSON array object to concatenated string 将不等长的嵌套JSON转换为R中的数据帧 - Converting nested JSON of unequal lengths to data-frame in R 将嵌套的JSON对象转换为R中的数据帧 - Convert Nested JSON Object into data Frame in R 将 json object 转换为空间线数据框 - Convert json object to spatial lines data frame
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM