简体   繁体   English

使用 Apache Spark 读取 JSON - `corrupt_record`

[英]Reading JSON with Apache Spark - `corrupt_record`

I have a json file, nodes that looks like this:我有一个json文件, nodes如下所示:

[{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
,{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
,{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
,{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}]

I am able to read and manipulate this record with Python.我能够使用 Python 读取和操作此记录。

I am trying to read this file in scala through the spark-shell .我正在尝试通过spark-shellscala读取此文件。

From this tutorial , I can see that it is possible to read json via sqlContext.read.json从本教程中,我可以看到可以通过sqlContext.read.json读取json

val vfile = sqlContext.read.json("path/to/file/nodes.json")

However, this results in a corrupt_record error:但是,这会导致corrupt_record错误:

vfile: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

Can anyone shed some light on this error?任何人都可以对这个错误有所了解吗? I can read and use the file with other applications and I am confident it is not corrupt and sound json .我可以读取该文件并将其与其他应用程序一起使用,并且我确信它没有损坏和健全的json

Spark cannot read JSON-array to a record on top-level, so you have to pass: Spark 无法将 JSON 数组读取到顶级记录,因此您必须通过:

{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1} 
{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2} 
{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3} 
{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}

As it's described in the tutorial you're referring to:正如您所指的教程所述

Let's begin by loading a JSON file, where each line is a JSON object让我们从加载一个 JSON 文件开始,其中每一行都是一个 JSON 对象

The reasoning is quite simple.道理很简单。 Spark expects you to pass a file with a lot of JSON-entities (entity per line), so it could distribute their processing (per entity, roughly saying). Spark 期望您传递一个包含大量 JSON 实体(每行一个实体)的文件,因此它可以分发它们的处理(每个实体,粗略地说)。

To put more light on it, here is a quote form the official doc为了更清楚地了解它,这是官方文档的引用

Note that the file that is offered as a json file is not a typical JSON file.请注意,作为 json 文件提供的文件不是典型的 JSON 文件。 Each line must contain a separate, self-contained valid JSON object.每行必须包含一个单独的、自包含的有效 JSON 对象。 As a consequence, a regular multi-line JSON file will most often fail.因此,常规的多行 JSON 文件通常会失败。

This format is called JSONL .这种格式称为JSONL Basically it's an alternative to CSV.基本上它是CSV的替代品。

由于 Spark 期望“JSON 行格式”不是典型的 JSON 格式,因此我们可以通过指定来告诉 Spark 读取典型的 JSON:

val df = spark.read.option("multiline", "true").json("<file>")

To read the multi-line JSON as a DataFrame:要将多行 JSON 作为 DataFrame 读取:

val spark = SparkSession.builder().getOrCreate()

val df = spark.read.json(spark.sparkContext.wholeTextFiles("file.json").values)

Reading large files in this manner is not recommended, from the wholeTextFiles docs不推荐以这种方式读取大文件,来自WholeTextFiles 文档

Small files are preferred, large file is also allowable, but may cause bad performance.小文件是首选,大文件也是允许的,但可能会导致性能不佳。

I run into the same problem.我遇到了同样的问题。 I used sparkContext and sparkSql on the same configuration:我在相同的配置中使用了 sparkContext 和 sparkSql:

val conf = new SparkConf()
  .setMaster("local[1]")
  .setAppName("Simple Application")


val sc = new SparkContext(conf)

val spark = SparkSession
  .builder()
  .config(conf)
  .getOrCreate()

Then, using the spark context I read the whole json (JSON - path to file) file:然后,使用 spark 上下文我读取了整个 json(JSON - 文件路径)文件:

 val jsonRDD = sc.wholeTextFiles(JSON).map(x => x._2)

You can create a schema for future selects, filters...您可以为将来的选择、过滤器创建架构...

val schema = StructType( List(
  StructField("toid", StringType, nullable = true),
  StructField("point", ArrayType(DoubleType), nullable = true),
  StructField("index", DoubleType, nullable = true)
))

Create a DataFrame using spark sql:使用 spark sql 创建一个 DataFrame:

var df: DataFrame = spark.read.schema(schema).json(jsonRDD).toDF()

For testing use show and printSchema:测试使用 show 和 printSchema:

df.show()
df.printSchema()

sbt build file: sbt 构建文件:

name := "spark-single"

version := "1.0"

scalaVersion := "2.11.7"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.2"
libraryDependencies +="org.apache.spark" %% "spark-sql" % "2.0.2"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 JSON 文件读入 Spark 时出现 _corrupt_record 错误 - _corrupt_record error when reading a JSON file into Spark 无法加载复杂的JSON Spark HQL-_corrupt_record错误 - Unable to load Complex JSON Spark HQL - _corrupt_record error pyspark 损坏记录同时读取 json 文件 - pyspark corrupt_record while reading json file 如果在通过 spark 读取 json 时需要进行架构验证,则需要在架构中明确添加“_corrupt_record”列 - need to add "_corrupt_record" column explicitly in the schema if you need to do schema validation when reading json via spark 为什么读取 json 格式的文件导致所有记录都进入 pyspark 中的 _corrupt_record - Why reading a json format file resulting all the records going to _corrupt_record in pyspark SPARK-无法读取多行JSON(corrupt_record:字符串(nullable = true)) - SPARK - Can't read multiline JSON (corrupt_record: string (nullable = true)) 使用 JSON 架构 (PySpark) 时,Databricks 中的 _corrupt_record 列产生 NULL 值 - _corrupt_record Column in Databricks Yields NULL Values When Using JSON Schema (PySpark) Spark读取JSON的列已损坏 - Spark reading JSON has corrupt column Pyspark 在加载 jsons 列表时避免 _corrupt_record 列 - Pyspark avoid _corrupt_record column while loading a list of jsons 使用 Apache Spark 读取 JSON 阵列 - Reading JSON Array with Apache Spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM