简体   繁体   中英

Spark Java Map function is getting executed twice

I have above code as Spark driver, when I execute my program it works properly saving required data as Parquet file.

String indexFile = "index.txt";
JavaRDD<String> indexData = sc.textFile(indexFile).cache();
JavaRDD<String> jsonStringRDD = indexData.map(new Function<String, String>() {
  @Override
  public String call(String patientId) throws Exception {
   return "json array as string"
  }   
}); 

//1. Read json string array into a Dataframe (execution 1)
DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );
//2. Save dataframe as parquet file (execution 2)
dataSchemaDF.write().parquet("md.parquet");

But I observed my mapper function on RDD indexData is getting executed twice. first, when I read jsonStringRdd as DataFrame using SQLContext Second, when I write the dataSchemaDF to the parquet file

Can you guide me on this, how to avoid this repeated execution? Is there any other better way of converting JSON string into a Dataframe?

I believe that the reason is a lack of schema for JSON reader. When you execute:

sqlContext.read().json(jsonStringRDD);

Spark has to infer schema for a newly created DataFrame . To do that it has scan input RDD and this step is performed eagerly

If you want to avoid it you have to create a StructType which describes the shape of the JSON documents:

StructType schema;
...

and use it when you create DataFrame :

DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM