简体   繁体   English

在 spark scala 中解析 JSON 对象

[英]Parse JSON Object in spark scala

I have a set of json objects I want to read that data using a spark scala I will put a sample file data below one file contains more than 100 objects.我有一组 json 对象,我想使用 spark scala 读取该数据,我会将示例文件数据放在一个包含 100 多个对象的文件下方。

//file1 //文件1

{
  "id":"31342547689",
  "Name":"Jacob",
  "Sex":"M",
  "Destination":"Accounts"
}
{
  "id":"987875637898",
  "Name":"Martin",
  "Sex":"M",
  "Destination":"Sr.Accounts"
}
{
  "id":"64542457879",
  "Name":"lucifer",
  "Sex":"M",
  "Destination":"Developer"
}
{
  "id":"23824723354",
  "Name":"Ratin",
  "Sex":"M",
  "Destination":"Sr.Developer"
}

when I used the below code I can able to print only the first object.当我使用下面的代码时,我只能打印第一个对象。

val dataframe = spark
      .read
      .option("multiLine", true)
      .schema(Schema)
      .json("D:\\appdata\file1")
      .show()

you can read the pretty jsons file by using spark wholeTextFiles API您可以使用 spark WholeTextFiles API 读取漂亮的 jsons 文件

import spark.implicits._
val input = spark.wholeTextFiles(inputFile).map(_._2)

val ouput = input.mapPartitions(records => {
    // mapper object created on each executor node (ObjectMapper is not serializable so we either create a singleton object for each partition)
    val mapper = new ObjectMapper with ScalaObjectMapper
    mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
    mapper.registerModule(DefaultScalaModule)
     records.map(eachitem => {
        try {
            mapper.readValue(record, classOf[DTOclass])
         }catch {
           case e: Exception => null
         }
     })
   }).filter(_!=null).toDF.as[DTOclass]
ouput.write.json(oplocation)

you can use the Gson library as well instead of ObjectMapper.您也可以使用 Gson 库代替 ObjectMapper。

note : DTOclass should be serializable注意:DTOclass 应该是可序列化的

or you can clean your input pretty jsons file and can read it by spark.read.json API as mentioned here .或者您可以清理输入的漂亮 jsons 文件,并可以通过这里提到的spark.read.json API 读取它。

It seems your file doesn't have a valid Json.您的文件似乎没有有效的 Json。 You can also validate it at https://jsonlint.com/ .您还可以在https://jsonlint.com/验证它。 Ideally, it should be like this理想情况下,它应该是这样的

[{
  "id":"31342547689",
  "Name":"Jacob",
  "Sex":"M",
  "Destination":"Accounts"
},
{
  "id":"987875637898",
  "Name":"Martin",
  "Sex":"M",
  "Destination":"Sr.Accounts"
},
{
  "id":"64542457879",
  "Name":"lucifer",
  "Sex":"M",
  "Destination":"Developer"
},
{
  "id":"23824723354",
  "Name":"Ratin",
  "Sex":"M",
  "Destination":"Sr.Developer"
}]

So, first we need to preprocess this file to turn it into a valid Json.所以,首先我们需要预处理这个文件,把它变成一个有效的 Json。

val df=sc.wholeTextFiles("/yourHDFSLocation/file1.json").toDF
df.select("_2").map(s=>"["+s.mkString.replaceAll("\\}.*\n{0,}.*\\{","},{")+"]").repartition(1).write.mode("overwrite").text("/yourHDFSLocation/correctedJson/")

Then we can read our valid json.然后我们可以读取我们的有效 json。

val ok=spark.read.schema(simpleSchema).option("multiline", "true").json("/yourHDFSLocation/correctedJson/p*")  

Ouput:输出:

ok.show(false)
+------------+-------+---+------------+
|id          |Name   |Sex|Destination |
+------------+-------+---+------------+
|31342547689 |Jacob  |M  |Accounts    |
|987875637898|Martin |M  |Sr.Accounts |
|64542457879 |lucifer|M  |Developer   |
|23824723354 |Ratin  |M  |Sr.Developer|
+------------+-------+---+------------+

Another solution, if you don't want to save an intermediate file.另一种解决方案,如果您不想保存中间文件。

val rdd=sc.wholeTextFiles("/yourHDFSLocation/file1.json")
val rdd2=rdd.map(s=>"["+s._2.replaceAll("\\}.*\n{0,}.*\\{","},{")+"]")
val ok=spark.read.schema(simpleSchema).option("multiline", "true").json(rdd2)

Hope it resolves this problem!希望它能解决这个问题!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM