简体   繁体   English

在 Apache Spark 中解析 JSON 时出现奇怪的错误

[英]Weird error while parsing JSON in Apache Spark

Trying to parse a JSON document and Spark gives me an error:尝试解析 JSON 文档和 Spark 给我一个错误:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
   (named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.buildReader(JsonFileFormat.scala:120)
...
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
at org.apache.spark.sql.Dataset.show(Dataset.scala:746)
at org.apache.spark.sql.Dataset.show(Dataset.scala:705)
at xxx.MyClass.xxx(MyClass.java:25)

I already tried to open the JSON doc in several online editors and it's valid.我已经尝试在几个在线编辑器中打开 JSON 文档并且它是有效的。

This is my code:这是我的代码:

Dataset<Row> df = spark.read()
    .format("json")
    .load("file.json");

df.show(3); // this is line 25

I am using Java 8 and Spark 2.4.我正在使用 Java 8 和 Spark 2.4。

The _corrupt_record column is where Spark stores malformed records when it tries to ingest them. _corrupt_record列是 Spark 在尝试摄取它们时存储格式错误的记录的地方。 That could be a hint.这可能是一个提示。

Spark also process two types of JSON documents, JSON Lines and normal JSON (in the earlier versions Spark could only do JSON Lines). Spark 还处理两种类型的 JSON 文档,JSON Lines 和普通 JSON(在早期版本中 Spark 只能处理 JSON Lines)。 You can find more in this Manning article .您可以在这篇曼宁文章中找到更多信息。

You can try the multiline option, as in:您可以尝试multiline选项,如下所示:

Dataset<Row> df = spark.read()
    .format("json")
    .option("multiline", true)
    .load("file.json");

to see if it helps.看看它是否有帮助。 If not, share your JSON doc (if you can).如果没有,请分享您的 JSON 文档(如果可以)。

set the multiline option to true.将多行选项设置为 true。 If it does not work share your json如果它不起作用,请分享您的 json

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM