简体繁体中英

Is .parallelize(…) a lazy operation in Apache Spark?

原文 2015-12-27 11:45:15 7 4 scala/ apache-spark

Is parallelize (and other load operations) executed only at the time a Spark action is executed or immediately when it is encountered?

See def parallelize in spark code

Note the different consequences for instance for .textFile(...): Lazy evaluation would mean that while possibly saving some memory initially, the text file has to be read every time an action is performed and that a change in the text file would affect all actions after the change.

4 answers

parallelize is executed lazily: see L726 of your cited code stating "@note Parallelize acts lazily."

Execution in Spark is only triggered once you call an action eg collect or count .

Thus in total with Spark:

Chain of transformations is set up by the user API (you) eg parallelize, map, reduce, ...
Once an action is called the chain of transformations is "put" into the Catalyst optimizer, gets optimized and then executed.

... (and other load operations)

parallelize is lazy (as already stated by the Martin Senne and Chandan ), same as standard data loading operations defined on SparkContext like textFile .

DataFrameReader.load and related methods are in general only partialy lazy. Depending on a context it may require metadata access (JDBC sources, Cassandra) or even full data scan (CSV with schema inference).

Please note that here we have just defined RDD, data is not loaded still. This means that if you go to access the data in this RDD it could fail. The computation to create the data in an RDD is only done when the data is referenced; for example, it is created by caching or writing out the RDD.

cited from the link

Not only parallelize() , all transformations are lazy.

RDDs support two types of operations: transformations , which create a new dataset from an existing one, and actions , which return a value to the driver program after running a computation on the dataset.

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (eg a file). The transformations are only computed when an action requires a result to be returned to the driver program

Have a look at this article to know all transformations in Scala.

Have a look at this documentation for more details.

In Apache Spark, how to make an RDD/DataFrame operation lazy?

How to parallelize several apache spark rdds?

In Apache Spark, can I easily repeat/nest a SparkContext.parallelize?

Apache Spark mapPartition strange behavior (lazy evaluation?)

Apache Spark timing forEach operation on JavaRDD

Apache Spark: RDD multiple passes with a simple operation

Poor weak scaling of Apache Spark join operation

How to perform UPSERT or MERGE operation in Apache Spark?

parallelize list of dataset spark

SparkContext parallelize lazy behavior - unexplained

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question In Apache Spark, how to make an RDD/DataFrame operation lazy? How to parallelize several apache spark rdds? In Apache Spark, can I easily repeat/nest a SparkContext.parallelize? Apache Spark mapPartition strange behavior (lazy evaluation?) Apache Spark timing forEach operation on JavaRDD Apache Spark: RDD multiple passes with a simple operation Poor weak scaling of Apache Spark join operation How to perform UPSERT or MERGE operation in Apache Spark? parallelize list of dataset spark SparkContext parallelize lazy behavior - unexplained

Related Tags

Is .parallelize(…) a lazy operation in Apache Spark?

Question

4 answers

solution1
3 ACCPTED 2015-12-27 13:30:35

solution2
2 2015-12-27 13:55:49

solution3
1 2015-12-27 13:29:59

solution4
1 2015-12-28 08:14:09

Is .parallelize(…) a lazy operation in Apache Spark?

Question

4 answers

solution1 3 ACCPTED 2015-12-27 13:30:35

solution2 2 2015-12-27 13:55:49

solution3 1 2015-12-27 13:29:59

solution4 1 2015-12-28 08:14:09

solution1
3 ACCPTED 2015-12-27 13:30:35

solution2
2 2015-12-27 13:55:49

solution3
1 2015-12-27 13:29:59

solution4
1 2015-12-28 08:14:09