简体   繁体   中英

What is the difference between saveAsObjectFile and persist in apache spark?

I am trying to compare java and kryo serialization and on saving the rdd on disk, it is giving the same size when saveAsObjectFile is used, but on persist it shows different in spark ui. Kryo one is smaller then java, but the irony is processing time of java is lesser then that of kryo which was not expected from spark UI?

val conf = new SparkConf()
    .setAppName("kyroExample")
    .setMaster("local[*]")
    .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .registerKryoClasses(
      Array(classOf[Person],classOf[Array[Person]])
    )

  val sparkContext = new SparkContext(conf)
  val personList: Array[Person] = (1 to 100000).map(value => Person(value + "", value)).toArray

  val rddPerson: RDD[Person] = sparkContext.parallelize(personList)
  val evenAgePerson: RDD[Person] = rddPerson.filter(_.age % 2 == 0)

  case class Person(name: String, age: Int)

  evenAgePerson.saveAsObjectFile("src/main/resources/objectFile")
  evenAgePerson.persist(StorageLevel.MEMORY_ONLY_SER)
  evenAgePerson.count()

persist and saveAsObjectFile solve different needs.

persist has a misleading name. It is not supposed to be used to permanently persist the rdd result. Persist is used to temporarily persist a computation result of the rdd during a spark workfklow. The user has no control, over the location of the persisted dataframe. Persist is just caching with different caching strategy - memory, disk or both. In fact cache will just call persist with a default caching strategy.

eg

val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()

// Counts errors mentioning MySQL
// Runs again on the full dataframe of all the lines , repeats the above operation 
errors.filter(col("line").like("%MySQL%")).count()

vs

val errors = df.filter(col("line").like("%ERROR%"))
errors.persist()
// Counts all the errors
errors.count()

// Counts errors mentioning MySQL
// Runs only on the errors tmp result containing only the filtered error lines
errors.filter(col("line").like("%MySQL%")).count()

saveAsObjectFile is for permanent persistance. It is used for serializing the final result of a spark job onto a persistent and usually distributed filessystem like hdfs or amazon s3

Spark persist and saveAsObjectFile is not the same at all.

persist - persist your RDD DAG to the requested StorageLevel, thats mean from now and on any transformation apply on this RDD will be calculated only from the persisted DAG.

saveAsObjectFile - just save the RDD into a SequenceFile of serialized object.

saveAsObjectFile doesn`t use "spark.serializer" configuration at all. as you can see the below code:

  /**
   * Save this RDD as a SequenceFile of serialized objects.
   */
  def saveAsObjectFile(path: String): Unit = withScope {
    this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
      .map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x))))
      .saveAsSequenceFile(path)
  }

saveAsObjectFile uses Utils.serialize to serialize your object, when serialize method defenition is:

  /** Serialize an object using Java serialization */
  def serialize[T](o: T): Array[Byte] = {
    val bos = new ByteArrayOutputStream()
    val oos = new ObjectOutputStream(bos)
    oos.writeObject(o)
    oos.close()
    bos.toByteArray
  }

saveAsObjectFile use the Java Serialization always.

On the other hand, persist will use the configured spark.serializer you configure him.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM