I have a set of json objects I want to read that data using a spark scala I will put a sample file data below one file contains more than 100 objects.
//file1
{
"id":"31342547689",
"Name":"Jacob",
"Sex":"M",
"Destination":"Accounts"
}
{
"id":"987875637898",
"Name":"Martin",
"Sex":"M",
"Destination":"Sr.Accounts"
}
{
"id":"64542457879",
"Name":"lucifer",
"Sex":"M",
"Destination":"Developer"
}
{
"id":"23824723354",
"Name":"Ratin",
"Sex":"M",
"Destination":"Sr.Developer"
}
when I used the below code I can able to print only the first object.
val dataframe = spark
.read
.option("multiLine", true)
.schema(Schema)
.json("D:\\appdata\file1")
.show()
you can read the pretty jsons file by using spark wholeTextFiles API
import spark.implicits._
val input = spark.wholeTextFiles(inputFile).map(_._2)
val ouput = input.mapPartitions(records => {
// mapper object created on each executor node (ObjectMapper is not serializable so we either create a singleton object for each partition)
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.registerModule(DefaultScalaModule)
records.map(eachitem => {
try {
mapper.readValue(record, classOf[DTOclass])
}catch {
case e: Exception => null
}
})
}).filter(_!=null).toDF.as[DTOclass]
ouput.write.json(oplocation)
you can use the Gson library as well instead of ObjectMapper.
note : DTOclass should be serializable
or you can clean your input pretty jsons file and can read it by spark.read.json
API as mentioned here .
It seems your file doesn't have a valid Json. You can also validate it at https://jsonlint.com/ . Ideally, it should be like this
[{
"id":"31342547689",
"Name":"Jacob",
"Sex":"M",
"Destination":"Accounts"
},
{
"id":"987875637898",
"Name":"Martin",
"Sex":"M",
"Destination":"Sr.Accounts"
},
{
"id":"64542457879",
"Name":"lucifer",
"Sex":"M",
"Destination":"Developer"
},
{
"id":"23824723354",
"Name":"Ratin",
"Sex":"M",
"Destination":"Sr.Developer"
}]
So, first we need to preprocess this file to turn it into a valid Json.
val df=sc.wholeTextFiles("/yourHDFSLocation/file1.json").toDF
df.select("_2").map(s=>"["+s.mkString.replaceAll("\\}.*\n{0,}.*\\{","},{")+"]").repartition(1).write.mode("overwrite").text("/yourHDFSLocation/correctedJson/")
Then we can read our valid json.
val ok=spark.read.schema(simpleSchema).option("multiline", "true").json("/yourHDFSLocation/correctedJson/p*")
Ouput:
ok.show(false)
+------------+-------+---+------------+
|id |Name |Sex|Destination |
+------------+-------+---+------------+
|31342547689 |Jacob |M |Accounts |
|987875637898|Martin |M |Sr.Accounts |
|64542457879 |lucifer|M |Developer |
|23824723354 |Ratin |M |Sr.Developer|
+------------+-------+---+------------+
Another solution, if you don't want to save an intermediate file.
val rdd=sc.wholeTextFiles("/yourHDFSLocation/file1.json")
val rdd2=rdd.map(s=>"["+s._2.replaceAll("\\}.*\n{0,}.*\\{","},{")+"]")
val ok=spark.read.schema(simpleSchema).option("multiline", "true").json(rdd2)
Hope it resolves this problem!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.