简体   繁体   English

我如何将平面数据帧转换为spark(scala或java)中的嵌套json

[英]how do i convert a flat dataframe into a nested json in spark (scala or java)

I have sql query which returns the data set like this available in dataframe 我有sql查询,它返回数据集中可用的数据集

id,type,name,ppu,batter.id,batter.type,topping.id,topping.type
101,donut,cake,0_55,1001,Regular,5001,None
101,donut,cake,0_55,1002,Chocolate,5001,None
101,donut,cake,0_55,1003,Blueberry,5001,None
101,donut,cake,0_55,1004,Devil's Food,5001,None
101,donut,cake,0_55,1001,Regular,5002,Glazed
101,donut,cake,0_55,1002,Chocolate,5002,Glazed
101,donut,cake,0_55,1003,Blueberry,5002,Glazed
101,donut,cake,0_55,1004,Devil's Food,5002,Glazed
101,donut,cake,0_55,1001,Regular,5003,Chocolate
101,donut,cake,0_55,1002,Chocolate,5003,Chocolate
101,donut,cake,0_55,1003,Blueberry,5003,Chocolate
101,donut,cake,0_55,1004,Devil's Food,5003,Chocolate

I need to cover this into a nested json structure like this. 我需要将它覆盖到这样的嵌套json结构中。

{
    "id": "101",
    "type": "donut",
    "name": "Cake",
    "ppu": 0.55,
    "batter":
        [
            { "id": "1001", "type": "Regular" },
            { "id": "1002", "type": "Chocolate" },
            { "id": "1003", "type": "Blueberry" },
            { "id": "1004", "type": "Devil's Food" }
        ],
    "topping":
        [
            { "id": "5001", "type": "None" },
            { "id": "5002", "type": "Glazed" },
            { "id": "5003", "type": "Chocolate" }
        ]
}

do we have possibility to perform this in Dataframe aggregation or custom transformation i have to write. 我们是否有可能在我必须编写的Dataframe聚合或自定义转换中执行此操作。

Found similar question here Writing nested JSON in spark scala but doesnt have quite right answer. 在这里找到类似的问题在spark scala中编写嵌套的JSON但没有正确答案。

So, apparently there is no straight way to do this task via the dataframe API. 所以,显然没有直接的方法通过数据帧API完成这项任务。 You could use the 你可以使用

df.toJson.{..}

but it wont give you the output you want. 但它不会给你你想要的输出。

You'll have to write a messy transform, I'd love to hear any other possible solutions. 你必须写一个混乱的变换,我很想听到任何其他可能的解决方案。 I'm assuming that your result fits in memory as it must be brought back to the driver. 我假设你的结果适合内存,因为它必须带回驱动程序。 Also, I'm using Gson API to create the json here. 另外,我在这里使用Gson API来创建json。

def arrToJson(arr: Array[Row]): JsonObject = {
    val jo = new JsonObject
    arr.map(row => ((row(0) + "," + row(1) + "," + row(2) + "," + row(3)),
      (row(4) + "," + row(5) + "," + row(6) + "," + row(7))))
      .groupBy(_._1).map(f => (f._1.split(","), f._2.map(_._2.split(","))))
      .foreach { x =>
        {

          jo.addProperty("id", x._1(0))
          jo.addProperty("type", x._1(1))
          jo.addProperty("name", x._1(2))
          jo.addProperty("ppu", x._1(3))

          val bja = new JsonArray
          val tja = new JsonArray
          x._2.foreach(f => {
            val bjo = new JsonObject
            val tjo = new JsonObject

            bjo.addProperty("id", f(0))
            bjo.addProperty("type", f(1))

            tjo.addProperty("id", f(2))
            tjo.addProperty("type", f(3))

            bja.add(bjo)
            tja.add(tjo)
          })
          jo.add("batter", bja)
          jo.add("topping", tja)

        }
      }

    jo
  }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM