简体   繁体   中英

Is .parallelize(…) a lazy operation in Apache Spark?

Is parallelize (and other load operations) executed only at the time a Spark action is executed or immediately when it is encountered?

See def parallelize in spark code

Note the different consequences for instance for .textFile(...): Lazy evaluation would mean that while possibly saving some memory initially, the text file has to be read every time an action is performed and that a change in the text file would affect all actions after the change.

parallelize is executed lazily: see L726 of your cited code stating "@note Parallelize acts lazily."

Execution in Spark is only triggered once you call an action eg collect or count .

Thus in total with Spark:

  1. Chain of transformations is set up by the user API (you) eg parallelize, map, reduce, ...
  2. Once an action is called the chain of transformations is "put" into the Catalyst optimizer, gets optimized and then executed.

... (and other load operations)

parallelize is lazy (as already stated by the Martin Senne and Chandan ), same as standard data loading operations defined on SparkContext like textFile .

DataFrameReader.load and related methods are in general only partialy lazy. Depending on a context it may require metadata access (JDBC sources, Cassandra) or even full data scan (CSV with schema inference).

Please note that here we have just defined RDD, data is not loaded still. This means that if you go to access the data in this RDD it could fail. The computation to create the data in an RDD is only done when the data is referenced; for example, it is created by caching or writing out the RDD.

cited from the link

Not only parallelize() , all transformations are lazy.

RDDs support two types of operations: transformations , which create a new dataset from an existing one, and actions , which return a value to the driver program after running a computation on the dataset.

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (eg a file). The transformations are only computed when an action requires a result to be returned to the driver program

Have a look at this article to know all transformations in Scala.

Have a look at this documentation for more details.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM