繁体   English   中英

SparkContext并行化懒惰行为-无法解释

[英]SparkContext parallelize lazy behavior - unexplained

根据Spark源代码注释。

SparkContext.scala具有

  /** Distribute a local Scala collection to form an RDD.
   *
   * @note Parallelize acts lazily. If `seq` is a mutable collection and is altered after the call
   * to parallelize and before the first action on the RDD, the resultant RDD will reflect the
   * modified collection. Pass a copy of the argument to avoid this.
   * @note avoid using `parallelize(Seq())` to create an empty `RDD`. Consider `emptyRDD` for an
   * RDD with no partitions, or `parallelize(Seq[T]())` for an RDD of `T` with empty partitions.
   */

所以,我认为我会做一个简单的测试。

scala> var c = List("a0", "b0", "c0", "d0", "e0", "f0", "g0")
c: List[String] = List(a0, b0, c0, d0, e0, f0, g0)

scala> var crdd = sc.parallelize(c)
crdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:26

scala> c = List("x1", "y1")
c: List[String] = List(x1, y1)

scala> crdd.foreach(println)
[Stage 0:>                                                          (0 + 0) / 8]d0
a0
b0
e0
f0
g0
c0

scala>

我期望crdd.foreach(println)基于parallelize的惰性行为输出“ x1 ”和“ y1 ”。

我究竟做错了什么?

您根本没有修改c 您已将其重新分配给新列表。

除此之外,

如果seq是可变集合

Scala的List不是可变集合

并且在调用并行化之后以及在RDD上的第一个操作之前更改

好吧,瞧,您并没有真正改变列表。


这是记录的行为的正确示例。

scala> val c = scala.collection.mutable.ListBuffer(1, 2, 3)
c: scala.collection.mutable.ListBuffer[Int] = ListBuffer(1, 2, 3)

scala> val cRDD = sc.parallelize(c)
cRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:29

scala> c.append(4)

scala> c
res7: scala.collection.mutable.ListBuffer[Int] = ListBuffer(1, 2, 3, 4)

scala> cRDD.collect()
res8: Array[Int] = Array(1, 2, 3, 4)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM