简体繁体中英

Is there a way to change the replication factor of RDDs in Spark?

原文 2015-07-25 08:37:13 0 2 java/ scala/ hadoop/ apache-spark/ yarn

From what I understand, there are multiple copies of data in RDDs in the cluster, so that in case of failure of a node, the program can recover. However, in cases where chance of failure is negligible, it would be costly memory-wise to have multiple copies of data in the RDDs. So, my question is, is there a parameter in Spark, which can be used to reduce the replication factor of the RDDs?

2 answers

First, note Spark does not automatically cache all your RDD s, simply because applications may create many RDD s, and not all of them are to be reused. You have to call .persist() or .cache() on them.

You can set the storage level with which you want to persist an RDD with myRDD.persist(StorageLevel.MEMORY_AND_DISK) . .cache() is a shorthand for .persist(StorageLevel.MEMORY_ONLY) .

The default storage level for persist is indeed StorageLevel.MEMORY_ONLY for an RDD in Java or Scala – but usually differs if you are creating a DStream (refer to your DStream constructor API doc). If you're using Python, it's StorageLevel.MEMORY_ONLY_SER .

The doc details a number of storage levels and what they mean, but they're fundamentally a configuration shorthand to point Spark to an object which extends the StorageLevel class . You can thus define your own with a replication factor of up to 40.

Note that of the various predefined storage levels, some keep a single copy of the RDD . In fact, that's true of all of those which name isn't postfixed with _2 (except NONE ):

DISK_ONLY
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
OFF_HEAP

That's one copy per medium they employ, of course, if you want a single copy overall, you have to choose a single-medium storage level.

As huitseeker said unless you specifically ask Spark to persist an RDD and specify a StorageLevel that uses a replication, it won't have multiple copies of the partitions of an RDD.

What spark does do is keep a lineage of how a specific piece of data was calculated so that when/if a node fails it only repeats processing of relevant data that is needed to get to the lost RDD partitions - In my experience this mostly works though on occasion it is faster to restart the job then let it recover

Spark RDDs join operation with lists

Is it possible to create nested RDDs in Apache Spark?

Merging RDDs using Scala Apache Spark

Java Spark map step returning multiple RDDs

Spark: Cogroup RDDs fails in case of huge group

Join Two RDDs in Spark then Eliminate the Keys

How replication works in Spark?

Cassandra replication factor greater than number of nodes

Throughput vs replication factor on the read performance of cassandra

kafka replication factor less then broker count

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Spark RDDs join operation with lists Is it possible to create nested RDDs in Apache Spark? Merging RDDs using Scala Apache Spark Java Spark map step returning multiple RDDs Spark: Cogroup RDDs fails in case of huge group Join Two RDDs in Spark then Eliminate the Keys How replication works in Spark? Cassandra replication factor greater than number of nodes Throughput vs replication factor on the read performance of cassandra kafka replication factor less then broker count

Related Tags

Is there a way to change the replication factor of RDDs in Spark?

Question

2 answers

solution1
6 ACCPTED 2015-07-25 09:24:43

solution2
1 2015-07-25 19:19:06

Is there a way to change the replication factor of RDDs in Spark?

Question

2 answers

solution1 6 ACCPTED 2015-07-25 09:24:43

solution2 1 2015-07-25 19:19:06

solution1
6 ACCPTED 2015-07-25 09:24:43

solution2
1 2015-07-25 19:19:06