跳过的阶段对 Spark 作业有任何性能影响吗？

Question

I am running a spark structured streaming job which involves creation of an empty dataframe, updating it using each micro-batch as below.我正在运行一个 spark 结构化流式作业，其中涉及创建一个空的 dataframe，使用如下每个微批次对其进行更新。 With every micro batch execution, number of stages increases by 4. To avoid recomputation, I am persisting the updated StaticDF into memory after each update inside loop.每执行一次微批处理，阶段数就会增加 4。为了避免重新计算，我在循环内的每次更新后将更新的 StaticDF 持久化到 memory 中。 This helps in skipping those additional stages which gets created with every new micro batch.这有助于跳过每个新微批次创建的额外阶段。

My questions -我的问题-

1) Even though the total completed stages remains same as the increased stages are always skipped but can it cause a performance issue as there can be millions on skipped stages at one point of time? 1）即使总完成的阶段保持不变，因为增加的阶段总是被跳过，但它是否会导致性能问题，因为在一个时间点可能有数百万个跳过的阶段？
2) What happens when somehow some part or all of cached RDD is not available? 2）当缓存的RDD的一部分或全部不可用时会发生什么？ (node/executor failure). （节点/执行器故障）。 Spark documentation says that it doesn't materialise the whole data received from multiple micro batches so far so does it mean that it will need read all events again from Kafka to regenerate staticDF? Spark 文档说，到目前为止，它并没有具体化从多个微批次接收到的全部数据，这是否意味着它需要再次从 Kafka 读取所有事件以重新生成 staticDF？

// one time creation of empty static(not streaming) dataframe
val staticDF_schema = new StructType()
      .add("product_id", LongType)
      .add("created_at", LongType)
var staticDF = sparkSession
.createDataFrame(sparkSession.sparkContext.emptyRDD[Row], staticDF_schema)

// Note : streamingDF was created from Kafka source
    streamingDF.writeStream
      .trigger(Trigger.ProcessingTime(10000L))
      .foreachBatch {
        (micro_batch_DF: DataFrame) => {

        // fetching max created_at for each product_id in current micro-batch
          val staging_df = micro_batch_DF.groupBy("product_id")
            .agg(max("created").alias("created"))

          // Updating staticDF using current micro batch
          staticDF = staticDF.unionByName(staging_df)
          staticDF = staticDF
            .withColumn("rnk",
              row_number().over(Window.partitionBy("product_id").orderBy(desc("created_at")))
            ).filter("rnk = 1")
            .drop("rnk")
              .cache()

          }

Answer 1

Even though the skipped stages doesn't need any computation but my job started failing after a certain number of batches.即使跳过的阶段不需要任何计算，但我的工作在一定数量的批次后开始失败。 This was because of DAG growth with every batch execution, making it un-manageable and throwing stack overflow exception.这是因为每次批处理执行时 DAG 都会增长，使其无法管理并引发堆栈溢出异常。

To avoid this, I had to break the spark lineage so that number of stages don't increase with every run (even if they are skipped)为了避免这种情况，我不得不打破火花谱系，这样每次运行的阶段数都不会增加（即使它们被跳过）

跳过的阶段对 Spark 作业有任何性能影响吗？

问题描述

1 个解决方案

解决方案1
0 2020-04-23 07:51:10

跳过的阶段对 Spark 作业有任何性能影响吗？

问题描述

1 个解决方案

解决方案1 0 2020-04-23 07:51:10

解决方案1
0 2020-04-23 07:51:10