简体   繁体   English

.parallelize(...)在Apache Spark中是一个懒惰的操作吗?

[英]Is .parallelize(…) a lazy operation in Apache Spark?

Is parallelize (and other load operations) executed only at the time a Spark action is executed or immediately when it is encountered? 并行化(和其他加载操作)是仅在执行Spark操作时执行还是在遇到它时立即执行?

See def parallelize in spark code 请参阅Spark 代码中的 def parallelize

Note the different consequences for instance for .textFile(...): Lazy evaluation would mean that while possibly saving some memory initially, the text file has to be read every time an action is performed and that a change in the text file would affect all actions after the change. 请注意.textFile(...)的不同后果:延迟评估意味着虽然最初可能会保存一些内存,但每次执行操作时都必须读取文本文件,并且文本文件中的更改会影响改变后的所有行动。

parallelize is executed lazily: see L726 of your cited code stating "@note Parallelize acts lazily." parallelize执行是懒惰的:请参阅引用代码的L726,说明“@note Parallelize懒惰地行动”。

Execution in Spark is only triggered once you call an action eg collect or count . 只有在您调用操作(例如collectcount )时才会触发Spark中的执行。

Thus in total with Spark: 因此与Spark一起:

  1. Chain of transformations is set up by the user API (you) eg parallelize, map, reduce, ... 转换链由用户API(您)设置,例如parallelize,map,reduce,...
  2. Once an action is called the chain of transformations is "put" into the Catalyst optimizer, gets optimized and then executed. 一旦调用了一个动作 ,转换链就会“放入”Catalyst优化器中,进行优化然后执行。

... (and other load operations) ...(和其他加载操作)

parallelize is lazy (as already stated by the Martin Senne and Chandan ), same as standard data loading operations defined on SparkContext like textFile . parallelize是懒惰的(正如Martin SenneChandan所说),与SparkContext定义的标准数据加载操作(如textFile )相同。

DataFrameReader.load and related methods are in general only partialy lazy. DataFrameReader.load和相关方法通常只是部分懒惰。 Depending on a context it may require metadata access (JDBC sources, Cassandra) or even full data scan (CSV with schema inference). 根据上下文,它可能需要元数据访问(JDBC源,Cassandra)或甚至完整数据扫描(带有模式推断的CSV)。

Please note that here we have just defined RDD, data is not loaded still. 请注意,这里我们刚刚定义了RDD,数据仍未加载。 This means that if you go to access the data in this RDD it could fail. 这意味着如果您去访问此RDD中的数据,它可能会失败。 The computation to create the data in an RDD is only done when the data is referenced; 在RDD中创建数据的计算仅在引用数据时完成; for example, it is created by caching or writing out the RDD. 例如,它是通过缓存或写出RDD创建的。

cited from the link 引用链接

Not only parallelize() , all transformations are lazy. parallelize() ,所有transformations都是懒惰的。

RDDs support two types of operations: transformations , which create a new dataset from an existing one, and actions , which return a value to the driver program after running a computation on the dataset. RDD支持两种类型的操作: transformations (从现有数据集创建新数据集)和actions (在数据集上运行计算后将值返回到驱动程序)。

All transformations in Spark are lazy, in that they do not compute their results right away. Spark中的所有转换都是惰性的,因为它们不会立即计算结果。 Instead, they just remember the transformations applied to some base dataset (eg a file). 相反,他们只记得应用于某些基础数据集的转换(例如文件)。 The transformations are only computed when an action requires a result to be returned to the driver program

Have a look at this article to know all transformations in Scala. 看看这篇文章 ,了解Scala中的所有transformations

Have a look at this documentation for more details. 有关更多详细信息,请查看此文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM