简体   繁体   English

将大量JSON文件读入Spark Dataframe

[英]Reading massive JSON files into Spark Dataframe

I have a large nested NDJ (new line delimited JSON) file that I need to read into a single spark dataframe and save to parquet. 我有一个大的嵌套NDJ(新行分隔的JSON)文件,我需要读入一个火花数据帧并保存到镶木地板。 In an attempt to render the schema I use this function: 在尝试呈现模式时,我使用此函数:

def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
        schema.fields.flatMap(f => {
          val colName = if (prefix == null) f.name else (prefix + "." + f.name)
          f.dataType match {
            case st: StructType => flattenSchema(st, colName)
            case _ => Array(col(colName))
          }
        })
  }

on the dataframe that is returned by reading by 在通过读取返回的数据帧上

val df = sqlCtx.read.json(sparkContext.wholeTextFiles(path).values)

I've also switched this to val df = spark.read.json(path) so that this only works with NDJs and not multi-line JSON--same error. 我也把它切换到val df = spark.read.json(path) ,这样只适用于NDJ而不是多行JSON - 同样的错误。

This is causing an out of memory error on the workers java.lang.OutOfMemoryError: Java heap space . 这导致worker java.lang.OutOfMemoryError: Java heap space出现内存不足错误。

I've altered the jvm memory options and spark executor/driver options to no avail 我已经改变了jvm内存选项和spark执行器/驱动程序选项无济于事

Is there a way to stream the file, flatten the schema, and add to a dataframe incrementally? 有没有办法流式传输文件,展平架构,并逐步添加到数据框? Some lines of the JSON contain new fields from the preceding entires...so those would need to be filled in later. JSON的某些行包含前面提到的新字段...因此需要稍后填写。

No work around. 没有解决方法。 The issue was with the JVM object limit. 问题在于JVM对象限制。 I ended up using a scala json parser and built the dataframe manually. 我最终使用了scala json解析器并手动构建了数据帧。

You can achieve this in multiple ways. 您可以通过多种方式实现这一目标。

First while reading, you can provide the schema for dataframe to read json or you can allow the spark to infer the schema by itself. 首先在阅读时,您可以提供数据帧的架构来读取json,或者您可以允许spark自己推断架构。

Once the json is in dataframe, you can follow the following ways to flatten it. 一旦json处于数据框架中,您可以按照以下方法将其展平。

a. 一个。 Using explode() on dataframe - to flatten it. 在数据框架上使用explode()来展平它。 b. Using spark sql and access the nested fields using . 使用spark sql并使用访问嵌套字段。 operator. 运营商。 You can find examples here 你可以在这里找到例子

Lastly, if you want to add new columns to dataframe a. 最后,如果要向dataframe a添加新列。 First option,using withColumn() is one approach. 第一个选项,使用withColumn()是一种方法。 However this will be done for each new column added and for entire data set. 但是,对于添加的每个新列和整个数据集,都将执行此操作。 b. Using sql to generate new dataframe from existing - this may be easiest c. 使用sql从现有生成新的数据帧 - 这可能是最简单的c。 Lastly, using map, then accessing elements, get old schema, add new values, create new schema and finally get the new df - as below 最后,使用map,然后访问元素,获取旧模式,添加新值,创建新模式并最终获得新df - 如下所示

One withColumn will work on entire rdd. 一个withColumn将在整个rdd上工作。 So generally its not a good practise to use the method for every column you want to add. 因此,对于要添加的每个列,通常使用该方法并不是一个好习惯。 There is a way where you work with columns and their data inside a map function. 有一种方法可以在map函数中处理列及其数据。 Since one map function is doing the job here, the code to add new column and its data will be done in parallel. 由于一个map函数在这里完成工作,因此添加新列及其数据的代码将并行完成。

a. 一个。 you can gather new values based on the calculations 您可以根据计算收集新值

b. Add these new column values to main rdd as below 将这些新列值添加到main rdd,如下所示

val newColumns: Seq[Any] = Seq(newcol1,newcol2)
Row.fromSeq(row.toSeq.init ++ newColumns)

Here row, is the reference of row in map method 这里的row,是map方法中行的引用

c. C。 Create new schema as below 创建新架构,如下所示

val newColumnsStructType = StructType{Seq(new StructField("newcolName1",IntegerType),new StructField("newColName2", IntegerType))

d. d。 Add to the old schema 添加到旧架构

val newSchema = StructType(mainDataFrame.schema.init ++ newColumnsStructType)

e. Create new dataframe with new columns 使用新列创建新数据框

val newDataFrame = sqlContext.createDataFrame(newRDD, newSchema)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM