检查点/持久化/改组似乎并没有像“学习火花”一书中详述的那样“短路”rdd的谱系

Question

In learning Spark, I read the following:在学习 Spark 时，我阅读了以下内容：

In addition to pipelining, Spark's internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk.除了流水线，如果现有的 RDD 已经被持久化在集群内存或磁盘上，Spark 的内部调度器可能会截断 RDD 图的谱系。 Spark can “short-circuit” in this case and just begin computing based on the persisted RDD.在这种情况下，Spark 可以“短路”并开始基于持久化的 RDD 进行计算。 A second case in which this truncation can happen is when an RDD is already materialized as a side effect of an earlier shuffle, even if it was not explicitly persist()ed.可能发生这种截断的第二种情况是当 RDD 已经作为早期 shuffle 的副作用具体化时，即使它没有显式地持久化。 This is an under-the-hood optimization that takes advantage of the fact that Spark shuffle outputs are written to disk, and exploits the fact that many times portions of the RDD graph are recomputed.这是一种底层优化，它利用了 Spark shuffle 输出写入磁盘的事实，并利用了多次重新计算 RDD 图的部分这一事实。

So, I decided to try to see this in action with a simple program (below):所以，我决定尝试用一个简单的程序（如下）来看看这个：

val pairs = spark.sparkContext.parallelize(List((1,2)))
val x   = pairs.groupByKey()
x.toDebugString  // before collect
x.collect()
x.toDebugString  // after collect

spark.sparkContext.setCheckpointDir("/tmp")
// try both checkpointing and persisting to disk to cut lineage
x.checkpoint()
x.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
x.collect()
x.toDebugString  // after checkpoint

I did not see what I expected after reading the above paragraph from the Spark book.阅读 Spark 书中的上述段落后，我没有看到我的预期。 I saw the exact same output of toDebugString each time I invoked this method -- each time indicating two stages (where I would have expected only one stage after the checkpoint was supposed to have truncated the lineage.) like this:每次调用此方法时，我都看到了完全相同的 toDebugString 输出——每次都指示两个阶段（在检查点应该截断世系之后，我预计只有一个阶段。）像这样：

scala>     x.toDebugString  // after collect
res5: String =
(8) ShuffledRDD[1] at groupByKey at <console>:25 []
 +-(8) ParallelCollectionRDD[0] at parallelize at <console>:23 []

I am wondering if the key thing that I overlooked might be the word "may", as in the "schedule MAY truncate the lineage".我想知道我忽略的关键是否可能是“可能”这个词，如“时间表可能会截断谱系”。 Is this truncation something that might happen given the same program that I wrote above, under other circumstances ?考虑到我在上面编写的相同程序，在其他情况下，这种截断是否可能发生？ Or is the little program that I wrote not doing the right thing to force the lineage truncation ?还是我写的小程序没有做正确的事情来强制谱系截断？ Thanks in advance for any insight you can provide !提前感谢您提供的任何见解！

Answer 1

I think that you should do persist/checkpoint before you do first collect .我认为你应该在做第一次collect之前做 persist/checkpoint 。 From that code for me it looks correct what you get since when spark does first collect it does not know that it should persist or save anything.从我的代码来看，它看起来是正确的，因为当 spark 第一次collect它不知道它应该持久化还是保存任何东西。

Also probably you need to save result of x.persist and then use it... I propose - try it:也可能你需要保存x.persist结果然后使用它......我建议 - 试试看：

val pairs = spark.sparkContext.parallelize(List((1,2)))
val x   = pairs.groupByKey()

x.checkpoint()
x.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)

// **Also maybe do val xx = x.persist(...) and use xx later.**

x.toDebugString  // before collect
x.collect()
x.toDebugString  // after collect

spark.sparkContext.setCheckpointDir("/tmp")
// try both checkpointing and persisting to disk to cut lineage
x.collect()
x.toDebugString  // after checkpoint

检查点/持久化/改组似乎并没有像“学习火花”一书中详述的那样“短路”rdd的谱系

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-07-15 11:49:04

检查点/持久化/改组似乎并没有像“学习火花”一书中详述的那样“短路”rdd的谱系

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-07-15 11:49:04

解决方案1
1 已采纳 2019-07-15 11:49:04