简体   繁体   中英

Is there a way to change the replication factor of RDDs in Spark?

From what I understand, there are multiple copies of data in RDDs in the cluster, so that in case of failure of a node, the program can recover. However, in cases where chance of failure is negligible, it would be costly memory-wise to have multiple copies of data in the RDDs. So, my question is, is there a parameter in Spark, which can be used to reduce the replication factor of the RDDs?

First, note Spark does not automatically cache all your RDD s, simply because applications may create many RDD s, and not all of them are to be reused. You have to call .persist() or .cache() on them.

You can set the storage level with which you want to persist an RDD with myRDD.persist(StorageLevel.MEMORY_AND_DISK) . .cache() is a shorthand for .persist(StorageLevel.MEMORY_ONLY) .

The default storage level for persist is indeed StorageLevel.MEMORY_ONLY for an RDD in Java or Scala – but usually differs if you are creating a DStream (refer to your DStream constructor API doc). If you're using Python, it's StorageLevel.MEMORY_ONLY_SER .

The doc details a number of storage levels and what they mean, but they're fundamentally a configuration shorthand to point Spark to an object which extends the StorageLevel class . You can thus define your own with a replication factor of up to 40.

Note that of the various predefined storage levels, some keep a single copy of the RDD . In fact, that's true of all of those which name isn't postfixed with _2 (except NONE ):

  • DISK_ONLY
  • MEMORY_ONLY
  • MEMORY_ONLY_SER
  • MEMORY_AND_DISK
  • MEMORY_AND_DISK_SER
  • OFF_HEAP

That's one copy per medium they employ, of course, if you want a single copy overall, you have to choose a single-medium storage level.

As huitseeker said unless you specifically ask Spark to persist an RDD and specify a StorageLevel that uses a replication, it won't have multiple copies of the partitions of an RDD.

What spark does do is keep a lineage of how a specific piece of data was calculated so that when/if a node fails it only repeats processing of relevant data that is needed to get to the lost RDD partitions - In my experience this mostly works though on occasion it is faster to restart the job then let it recover

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM