简体   繁体   中英

Parse JSON Object in spark scala

I have a set of json objects I want to read that data using a spark scala I will put a sample file data below one file contains more than 100 objects.

//file1

{
  "id":"31342547689",
  "Name":"Jacob",
  "Sex":"M",
  "Destination":"Accounts"
}
{
  "id":"987875637898",
  "Name":"Martin",
  "Sex":"M",
  "Destination":"Sr.Accounts"
}
{
  "id":"64542457879",
  "Name":"lucifer",
  "Sex":"M",
  "Destination":"Developer"
}
{
  "id":"23824723354",
  "Name":"Ratin",
  "Sex":"M",
  "Destination":"Sr.Developer"
}

when I used the below code I can able to print only the first object.

val dataframe = spark
      .read
      .option("multiLine", true)
      .schema(Schema)
      .json("D:\\appdata\file1")
      .show()

you can read the pretty jsons file by using spark wholeTextFiles API

import spark.implicits._
val input = spark.wholeTextFiles(inputFile).map(_._2)

val ouput = input.mapPartitions(records => {
    // mapper object created on each executor node (ObjectMapper is not serializable so we either create a singleton object for each partition)
    val mapper = new ObjectMapper with ScalaObjectMapper
    mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
    mapper.registerModule(DefaultScalaModule)
     records.map(eachitem => {
        try {
            mapper.readValue(record, classOf[DTOclass])
         }catch {
           case e: Exception => null
         }
     })
   }).filter(_!=null).toDF.as[DTOclass]
ouput.write.json(oplocation)

you can use the Gson library as well instead of ObjectMapper.

note : DTOclass should be serializable

or you can clean your input pretty jsons file and can read it by spark.read.json API as mentioned here .

It seems your file doesn't have a valid Json. You can also validate it at https://jsonlint.com/ . Ideally, it should be like this

[{
  "id":"31342547689",
  "Name":"Jacob",
  "Sex":"M",
  "Destination":"Accounts"
},
{
  "id":"987875637898",
  "Name":"Martin",
  "Sex":"M",
  "Destination":"Sr.Accounts"
},
{
  "id":"64542457879",
  "Name":"lucifer",
  "Sex":"M",
  "Destination":"Developer"
},
{
  "id":"23824723354",
  "Name":"Ratin",
  "Sex":"M",
  "Destination":"Sr.Developer"
}]

So, first we need to preprocess this file to turn it into a valid Json.

val df=sc.wholeTextFiles("/yourHDFSLocation/file1.json").toDF
df.select("_2").map(s=>"["+s.mkString.replaceAll("\\}.*\n{0,}.*\\{","},{")+"]").repartition(1).write.mode("overwrite").text("/yourHDFSLocation/correctedJson/")

Then we can read our valid json.

val ok=spark.read.schema(simpleSchema).option("multiline", "true").json("/yourHDFSLocation/correctedJson/p*")  

Ouput:

ok.show(false)
+------------+-------+---+------------+
|id          |Name   |Sex|Destination |
+------------+-------+---+------------+
|31342547689 |Jacob  |M  |Accounts    |
|987875637898|Martin |M  |Sr.Accounts |
|64542457879 |lucifer|M  |Developer   |
|23824723354 |Ratin  |M  |Sr.Developer|
+------------+-------+---+------------+

Another solution, if you don't want to save an intermediate file.

val rdd=sc.wholeTextFiles("/yourHDFSLocation/file1.json")
val rdd2=rdd.map(s=>"["+s._2.replaceAll("\\}.*\n{0,}.*\\{","},{")+"]")
val ok=spark.read.schema(simpleSchema).option("multiline", "true").json(rdd2)

Hope it resolves this problem!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM