简体   繁体   中英

Spark checkpointing behaviour

Does Spark use checkpoints when we start a new job? Let's say we used a checkpoint to write some RDD to a disk. Will the said RDD be recalculated or loaded from the disk during a new job?

In addition to the points given by @maxime G...

Spark Does not offer default checkpointing .. we need to explicitly set it.

Checkpointing is actually a feature of Spark Core (that Spark SQL uses for distributed computations) that allows a driver to be restarted on failure with previously computed state of a distributed computation described as an RDD

Spark offers two varieties of checkpointing.

Reliable checkpointing : Reliable checkpointing uses reliable data storage like Hadoop HDFS OR S3. and you can achieve by simply doing

sparkContext.setCheckpointDir("(hdfs:// or s3://)tmp/checkpoint/")
then dataframe.checkpoint(eager = true)

and Nonreliable checkpointing : which is Local checkpointing uses executor storage (ie node-local disk storage) to write checkpoint files to and due to the executor lifecycle is considered unreliable and it does not promise data to be available if the job terminates abruptly.

sparkContext.setCheckpointDir("/tmp/checkpoint/").
 dataframe.localCheckpoint(eager = true)

(Be careful when you are checkpointing in local mode and cluster autoscaling is enabled ..)

Note: Checkpointing can be eager or lazy per eager flag of the checkpoint operator. Eager checkpointing is the default checkpointing and happens immediately when requested. Lazy checkpointing does not and will only happen when an action is executed. The eager checkpoint will create an immediate stage barrier and later one wait for any particular action to happen and remember all previous transformations.

at the start of the job, if a RDD is present in your checkpoint location, it will be loaded.

That also mean that if you change code, you should also be careful about checkpointing because a RDD with old code is loaded with new code and that can cause conflict.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM