Reading JSON Array with Apache Spark

Question

I have a json array file, which looks like this:

["{\"timestamp\":1616549396892,\"id\":\"1\",\"events\":[{\"event_type\":\"ON\"}]}",{"meta":{"headers":{"app":"music"},"customerId":"2"}}]

I have a json file, nodes that looks like this:

I am trying to read this file in scala through the spark-shell.

val s1 = spark.read.json("path/to/file/file.json")

However, this results in a corrupt_record error:

org.apache.spark.sql.DataFrame = [_corrupt_record: string]

I have also tried reading it like this:

val df = spark.read.json(spark.sparkContext.wholeTextFiles("path.json").values)
val df = spark.read.option("multiline", "true").json("<file>")

But still the same error.

As the json array contains string and json objects may be thats why I am not able to read it.

Can anyone shed some light on this error? How can we read it via spark udf?

Answer 1

Yes the reason is the mix of text and actual json object. To me it looks though as if the two entries belong together so why not change the schema to sth like this:

{"meta":{"headers": {"app": "music"},"customerId": "2"},"data": "{\"timestamp\":1616549396892,\"id\":\"1\",\"events\":[{\"event_type\":\"ON\"}]}"}

New line means also new record so for multiple events your file would look like this:

{"meta":{"headers": {"app": "music"},"customerId": "2"},"data": "{\"timestamp\":1616549396892,\"id\":\"1\",\"events\":[{\"event_type\":\"ON\"}]}"}
{"meta":{"headers": {"app": "music"},"customerId": "2"},"data": "{\"timestamp\":1616549396892,\"id\":\"2\",\"events\":[{\"event_type\":\"ON\"}]}"}

Reading JSON Array with Apache Spark

Question

1 answers

solution1
0 2021-03-25 11:01:08

Reading JSON Array with Apache Spark

Question

1 answers

solution1 0 2021-03-25 11:01:08

solution1
0 2021-03-25 11:01:08