简体   繁体   中英

In Spark is it good practice to persist after an action?

Given this example;

    val someRDD = firstRDD.flatMap{ case(x,y) => SomeFunc(y)}
    val oneRDD = someRDD.reduceByKey(_+_)
    oneRDD.saveAsNewAPIHadoopFile("dir/to/write/to", classOf[Text], classOf[Text], classOf[TextOutputFormat[Text, Text]])

Which would be better to do?

    val someRDD = firstRDD.flatMap{ case(x,y) => SomeFunc(y)}.persist(storage.StorageLevel.MEMORY_AND_DISK_SER)
    val oneRDD = someRDD.reduceByKey(_+_)
    oneRDD.saveAsNewAPIHadoopFile("dir/to/write/to", classOf[Text], classOf[Text], classOf[TextOutputFormat[Text, Text]])

OR

    val someRDD = firstRDD.flatMap{ case(x,y) => SomeFunc(y)}.persist(storage.StorageLevel.MEMORY_AND_DISK_SER)
    val oneRDD = someRDD.reduceByKey(_+_).persist(storage.StorageLevel.MEMORY_AND_DISK_SER)
    oneRDD.saveAsNewAPIHadoopFile("dir/to/write/to", classOf[Text], classOf[Text], classOf[TextOutputFormat[Text, Text]])

or something else?

I see that it is good to persist when you are performing more than one action on the same RDD .

example being;

val newRDD = context.parallelize(0 until numMappers, numPartitions).persist(storage.StorageLevel.MEMORY_AND_DISK_SER)  #persisted bc there are two follow on actions preformed on it.
newRDD.count() #same RDD
newRDD.saveAsNewAPIHadoopFile() #same RDD
...other actions etc.

Here it is only one RDD and two actions in line. Should I persist as all.

From Spark documentation:

Spark also automatically persists some intermediate data in shuffle operations (eg reduceByKey ), even without users calling persist . This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it .

(I added bold around the above statement)

Note that chaining transformations is fine. The performance problem would occur when when reusing an RDD

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM