在 spark scala 中解析 JSON 对象

Question

I have a set of json objects I want to read that data using a spark scala I will put a sample file data below one file contains more than 100 objects.我有一组 json 对象，我想使用 spark scala 读取该数据，我会将示例文件数据放在一个包含 100 多个对象的文件下方。

//file1 //文件1

{
  "id":"31342547689",
  "Name":"Jacob",
  "Sex":"M",
  "Destination":"Accounts"
}
{
  "id":"987875637898",
  "Name":"Martin",
  "Sex":"M",
  "Destination":"Sr.Accounts"
}
{
  "id":"64542457879",
  "Name":"lucifer",
  "Sex":"M",
  "Destination":"Developer"
}
{
  "id":"23824723354",
  "Name":"Ratin",
  "Sex":"M",
  "Destination":"Sr.Developer"
}

when I used the below code I can able to print only the first object.当我使用下面的代码时，我只能打印第一个对象。

val dataframe = spark
      .read
      .option("multiLine", true)
      .schema(Schema)
      .json("D:\\appdata\file1")
      .show()

Answer 1

you can read the pretty jsons file by using spark wholeTextFiles API您可以使用 spark WholeTextFiles API 读取漂亮的 jsons 文件

import spark.implicits._
val input = spark.wholeTextFiles(inputFile).map(_._2)

val ouput = input.mapPartitions(records => {
    // mapper object created on each executor node (ObjectMapper is not serializable so we either create a singleton object for each partition)
    val mapper = new ObjectMapper with ScalaObjectMapper
    mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
    mapper.registerModule(DefaultScalaModule)
     records.map(eachitem => {
        try {
            mapper.readValue(record, classOf[DTOclass])
         }catch {
           case e: Exception => null
         }
     })
   }).filter(_!=null).toDF.as[DTOclass]
ouput.write.json(oplocation)

you can use the Gson library as well instead of ObjectMapper.您也可以使用 Gson 库代替 ObjectMapper。

note : DTOclass should be serializable注意：DTOclass 应该是可序列化的

or you can clean your input pretty jsons file and can read it by spark.read.json API as mentioned here .或者您可以清理输入的漂亮 jsons 文件，并可以通过这里提到的spark.read.json API 读取它。

Answer 2

It seems your file doesn't have a valid Json.您的文件似乎没有有效的 Json。 You can also validate it at https://jsonlint.com/ .您还可以在https://jsonlint.com/验证它。 Ideally, it should be like this理想情况下，它应该是这样的

[{
  "id":"31342547689",
  "Name":"Jacob",
  "Sex":"M",
  "Destination":"Accounts"
},
{
  "id":"987875637898",
  "Name":"Martin",
  "Sex":"M",
  "Destination":"Sr.Accounts"
},
{
  "id":"64542457879",
  "Name":"lucifer",
  "Sex":"M",
  "Destination":"Developer"
},
{
  "id":"23824723354",
  "Name":"Ratin",
  "Sex":"M",
  "Destination":"Sr.Developer"
}]

So, first we need to preprocess this file to turn it into a valid Json.所以，首先我们需要预处理这个文件，把它变成一个有效的 Json。

val df=sc.wholeTextFiles("/yourHDFSLocation/file1.json").toDF
df.select("_2").map(s=>"["+s.mkString.replaceAll("\\}.*\n{0,}.*\\{","},{")+"]").repartition(1).write.mode("overwrite").text("/yourHDFSLocation/correctedJson/")

Then we can read our valid json.然后我们可以读取我们的有效 json。

val ok=spark.read.schema(simpleSchema).option("multiline", "true").json("/yourHDFSLocation/correctedJson/p*")

Ouput:输出：

ok.show(false)
+------------+-------+---+------------+
|id          |Name   |Sex|Destination |
+------------+-------+---+------------+
|31342547689 |Jacob  |M  |Accounts    |
|987875637898|Martin |M  |Sr.Accounts |
|64542457879 |lucifer|M  |Developer   |
|23824723354 |Ratin  |M  |Sr.Developer|
+------------+-------+---+------------+

Another solution, if you don't want to save an intermediate file.另一种解决方案，如果您不想保存中间文件。

val rdd=sc.wholeTextFiles("/yourHDFSLocation/file1.json")
val rdd2=rdd.map(s=>"["+s._2.replaceAll("\\}.*\n{0,}.*\\{","},{")+"]")
val ok=spark.read.schema(simpleSchema).option("multiline", "true").json(rdd2)

Hope it resolves this problem!希望它能解决这个问题！

在 spark scala 中解析 JSON 对象

问题描述

2 个解决方案

解决方案1
1 2020-08-26 09:00:39

解决方案2
1 已采纳 2020-08-26 11:07:45

在 spark scala 中解析 JSON 对象

问题描述

2 个解决方案

解决方案1 1 2020-08-26 09:00:39

解决方案2 1 已采纳 2020-08-26 11:07:45

解决方案1
1 2020-08-26 09:00:39

解决方案2
1 已采纳 2020-08-26 11:07:45