使用 Apache Spark 读取 JSON 阵列

Question

I have a json array file, which looks like this:我有一个 json 数组文件，如下所示：

["{\"timestamp\":1616549396892,\"id\":\"1\",\"events\":[{\"event_type\":\"ON\"}]}",{"meta":{"headers":{"app":"music"},"customerId":"2"}}]

I have a json file, nodes that looks like this:我有一个 json 文件，节点如下所示：

I am trying to read this file in scala through the spark-shell.我正在尝试通过 spark-shell 在 scala 中读取此文件。

val s1 = spark.read.json("path/to/file/file.json")

However, this results in a corrupt_record error:但是，这会导致损坏记录错误：

org.apache.spark.sql.DataFrame = [_corrupt_record: string]

I have also tried reading it like this:我也试过这样读：

val df = spark.read.json(spark.sparkContext.wholeTextFiles("path.json").values)
val df = spark.read.option("multiline", "true").json("<file>")

But still the same error.但仍然是同样的错误。

As the json array contains string and json objects may be thats why I am not able to read it.由于 json 数组包含字符串，而 json 对象可能就是我无法读取它的原因。

Can anyone shed some light on this error?任何人都可以阐明这个错误吗？ How can we read it via spark udf?我们如何通过 spark udf 读取它？

Answer 1

Yes the reason is the mix of text and actual json object.是的，原因是文本和实际 json object 的混合。 To me it looks though as if the two entries belong together so why not change the schema to sth like this:在我看来，这两个条目好像属于一起，所以为什么不将架构更改为这样的：

{"meta":{"headers": {"app": "music"},"customerId": "2"},"data": "{\"timestamp\":1616549396892,\"id\":\"1\",\"events\":[{\"event_type\":\"ON\"}]}"}

New line means also new record so for multiple events your file would look like this:新行也意味着新记录，因此对于多个事件，您的文件将如下所示：

{"meta":{"headers": {"app": "music"},"customerId": "2"},"data": "{\"timestamp\":1616549396892,\"id\":\"1\",\"events\":[{\"event_type\":\"ON\"}]}"}
{"meta":{"headers": {"app": "music"},"customerId": "2"},"data": "{\"timestamp\":1616549396892,\"id\":\"2\",\"events\":[{\"event_type\":\"ON\"}]}"}

使用 Apache Spark 读取 JSON 阵列

问题描述

1 个解决方案

解决方案1
0 2021-03-25 11:01:08

使用 Apache Spark 读取 JSON 阵列

问题描述

1 个解决方案

解决方案1 0 2021-03-25 11:01:08

解决方案1
0 2021-03-25 11:01:08