简体   繁体   中英

Weird error while parsing JSON in Apache Spark

Trying to parse a JSON document and Spark gives me an error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
   (named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.buildReader(JsonFileFormat.scala:120)
...
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
at org.apache.spark.sql.Dataset.show(Dataset.scala:746)
at org.apache.spark.sql.Dataset.show(Dataset.scala:705)
at xxx.MyClass.xxx(MyClass.java:25)

I already tried to open the JSON doc in several online editors and it's valid.

This is my code:

Dataset<Row> df = spark.read()
    .format("json")
    .load("file.json");

df.show(3); // this is line 25

I am using Java 8 and Spark 2.4.

The _corrupt_record column is where Spark stores malformed records when it tries to ingest them. That could be a hint.

Spark also process two types of JSON documents, JSON Lines and normal JSON (in the earlier versions Spark could only do JSON Lines). You can find more in this Manning article .

You can try the multiline option, as in:

Dataset<Row> df = spark.read()
    .format("json")
    .option("multiline", true)
    .load("file.json");

to see if it helps. If not, share your JSON doc (if you can).

set the multiline option to true. If it does not work share your json

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM