简体   繁体   中英

Reading JSON Array with Apache Spark

I have a json array file, which looks like this:

["{\"timestamp\":1616549396892,\"id\":\"1\",\"events\":[{\"event_type\":\"ON\"}]}",{"meta":{"headers":{"app":"music"},"customerId":"2"}}]

I have a json file, nodes that looks like this:

I am trying to read this file in scala through the spark-shell.

val s1 = spark.read.json("path/to/file/file.json")

However, this results in a corrupt_record error:

org.apache.spark.sql.DataFrame = [_corrupt_record: string]

I have also tried reading it like this:

val df = spark.read.json(spark.sparkContext.wholeTextFiles("path.json").values)
val df = spark.read.option("multiline", "true").json("<file>")

But still the same error.

As the json array contains string and json objects may be thats why I am not able to read it.

Can anyone shed some light on this error? How can we read it via spark udf?

Yes the reason is the mix of text and actual json object. To me it looks though as if the two entries belong together so why not change the schema to sth like this:

{"meta":{"headers": {"app": "music"},"customerId": "2"},"data": "{\"timestamp\":1616549396892,\"id\":\"1\",\"events\":[{\"event_type\":\"ON\"}]}"}

New line means also new record so for multiple events your file would look like this:

{"meta":{"headers": {"app": "music"},"customerId": "2"},"data": "{\"timestamp\":1616549396892,\"id\":\"1\",\"events\":[{\"event_type\":\"ON\"}]}"}
{"meta":{"headers": {"app": "music"},"customerId": "2"},"data": "{\"timestamp\":1616549396892,\"id\":\"2\",\"events\":[{\"event_type\":\"ON\"}]}"}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM