Spark Java Map function is getting executed twice

Question

I have above code as Spark driver, when I execute my program it works properly saving required data as Parquet file.

String indexFile = "index.txt";
JavaRDD<String> indexData = sc.textFile(indexFile).cache();
JavaRDD<String> jsonStringRDD = indexData.map(new Function<String, String>() {
  @Override
  public String call(String patientId) throws Exception {
   return "json array as string"
  }   
}); 

//1. Read json string array into a Dataframe (execution 1)
DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );
//2. Save dataframe as parquet file (execution 2)
dataSchemaDF.write().parquet("md.parquet");

But I observed my mapper function on RDD indexData is getting executed twice. first, when I read jsonStringRdd as DataFrame using SQLContext Second, when I write the dataSchemaDF to the parquet file

Can you guide me on this, how to avoid this repeated execution? Is there any other better way of converting JSON string into a Dataframe?

Answer 1

I believe that the reason is a lack of schema for JSON reader. When you execute:

sqlContext.read().json(jsonStringRDD);

Spark has to infer schema for a newly created DataFrame . To do that it has scan input RDD and this step is performed eagerly

If you want to avoid it you have to create a StructType which describes the shape of the JSON documents:

StructType schema;
...

and use it when you create DataFrame :

DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);

Spark Java Map function is getting executed twice

Question

1 answers

solution1
7 ACCPTED 2016-10-16 18:15:22

Spark Java Map function is getting executed twice

Question

1 answers

solution1 7 ACCPTED 2016-10-16 18:15:22

solution1
7 ACCPTED 2016-10-16 18:15:22