简体   繁体   中英

Generic way to Parse Spark DataFrame to JSON Object/Array Using Spray JSON

I'm trying to find a generic way to parse Spark DataFrame to JSON Object/Array using Spray JSON or any other library. to parse Spark DataFrame to JSON Object/Array using Spray JSON or any other library.

I have tried to approach this using spray-json and my current code looks something like this

import spray.json._
import spray.json.DefaultJsonProtocol._

val list = sc.parallelize(List(("a1","b1","c1","d1"),("a2","b2","c2","d2"))).toDF

list.show
+---+---+---+---+                                                               
| _1| _2| _3| _4|
+---+---+---+---+
| a1| b1| c1| d1|
| a2| b2| c2| d2|
+---+---+---+---+

val json = list.toJSON.collect.toJson.prettyPrint

println(json)

Current Output:

["{\"_1\":\"a1\",\"_2\":\"b1\",\"_3\":\"c1\",\"_4\":\"d1\"}", "{\"_1\":\"a2\",\"_2\":\"b2\",\"_3\":\"c2\",\"_4\":\"d2\"}"]

Expected Output:

[{
    "_1": "a1",
    "_2": "b1",
    "_3": "c1",
    "_4": "d1"
}, {
    "_1": "a2",
    "_2": "b2",
    "_3": "c2",
    "_4": "d2"
}]

Kindly suggest how to get the expected output in the required format without using a "concrete scala case class". Either using spray-json or any other library.

I took help from an earlier post . If you would have had a look here, I think you would have got your answer.

You're correct half way through. By adding custom formatting code, you should be able to get your output in desired format.

import scala.util.parsing.json.JSON
import scala.util.parsing.json.JSONArray   
import scala.util.parsing.json.JSONFormat   
import scala.util.parsing.json.JSONObject   
import scala.util.parsing.json.JSONType

// Thanks to Senia for providing this in her solution
def format(t: Any, i: Int = 0): String = t match {
  case o: JSONObject =>
    o.obj.map{ case (k, v) =>
      "  "*(i+1) + JSONFormat.defaultFormatter(k) + ": " + format(v, i+1)
    }.mkString("{\n", ",\n", "\n" + "  "*i + "}")

  case a: JSONArray =>
    a.list.map{
      e => "  "*(i+1) + format(e, i+1)
    }.mkString("[\n", ",\n", "\n" + "  "*i + "]")

  case _ => JSONFormat defaultFormatter t
}

val list = sc.parallelize(List(("a1","b1","c1","d1"),("a2","b2","c2","d2"))).toDF

// Create array
val jsonArray = list.toJSON.collect()

val jsonFormattedArray = jsonArray.map(j => format(JSON.parseRaw(j).get))

res1: Array[String] =
Array({
  "_1": "a1",
  "_2": "b1",
  "_3": "c1",
  "_4": "d1"
}, {
  "_1": "a2",
  "_2": "b2",
  "_3": "c2",
  "_4": "d2"
})

Convert formatted Json to string

scala> jsonFormattedArray.toList.mkString(",")

res2: String =
{
  "_1": "a1",
  "_2": "b1",
  "_3": "c1",
  "_4": "d1"
},{
  "_1": "a2",
  "_2": "b2",
  "_3": "c2",
  "_4": "d2"
}

After trying various approach using various libraries, I finally settled with the below simple approach.

val list = sc.parallelize(List(("a1","b1","c1","d1"),("a2","b2","c2","d2"))).toDF

val jsonArray = list.toJSON.collect
/*jsonArray: Array[String] = Array({"_1":"a1","_2":"b1","_3":"c1","_4":"d1"}, {"_1":"a2","_2":"b2","_3":"c2","_4":"d2"})*/

val finalOutput = jsonArray.mkString("[", ",", "]")

/*finalOutput: String = [{"_1":"a2","_2":"b2","_3":"c2","_4":"d2"},{"_1":"a1","_2":"b1","_3":"c1","_4":"d1"}]*/

In this approach, we no need to use spray-JSON or any other library.

Special thanks to @Aman Sehgal. His answer helped me to come up with this optimal solution.

Note: I'm yet to analyze the performance of this approach using a large DF but with some basic performance testing it looks equally fast to ".toJson.prettyPrint" of "spray-json".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM