简体   繁体   English

spark-streaming 是否并行运行多个 foreach

[英]Does spark-streaming run multiple foreach in parallel

In this case在这种情况下

val dStream : Stream[_] = 
dStream.foreachRDD(a => ... )
dStream.foreachRDD(b => ... )

Do the foreach methods :执行 foreach 方法:

  1. run in parallel?并行运行?
  2. run in sequence but without specific order?按顺序运行但没有特定顺序?
  3. The foreachRDD( a => ) before the foreachRDD( b => )?在 foreachRDD( b => ) 之前的 foreachRDD( a => )?

I want to know that because I want to commit kafka offset after a database insert.我想知道这是因为我想在数据库插入后提交 kafka 偏移量。 ( And the db connector give just a "foreach" insert ) (而 db 连接器只提供一个“foreach”插入)

val dStream : Stream[_] = ...().cache()
dStream.toDb // consume the stream
dStream.foreachRDD(b => //commit offset ) //consume the stream but after the db insert

In the spark UI it's look like there is an order but I'm not sure it's reliable.在 spark UI 中,似乎有订单,但我不确定它是否可靠。

Edit : if foreachRDD( a => ) fail , do the foreachRDD( b => ) is still executed?编辑:如果 foreachRDD( a => ) 失败,是否仍然执行 foreachRDD( b => )?

DStream.foreach is deprecated since Spark 0.9.0. DStream.foreach自 Spark 0.9.0 起已弃用。 You want the equivalent DStream.foreachRDD to begin with.您希望以等效的DStream.foreachRDD开始。

Stages in the Spark DAG are executed sequentially, as one transformation's output is usually also the input for the next transformation in the graph, but this isn't the case in your example. Spark DAG 中的阶段按顺序执行,因为一个转换的输出通常也是图中下一个转换的输入,但在您的示例中并非如此。

What happens is that internally the RDD is divided into partitions.发生的情况是,RDD 在内部被划分为多个分区。 Each partition is ran on a different worker which is available to the cluster manager.每个分区都在不同的工作程序上运行,该工作程序可供集群管理器使用。 In your example, DStream.foreach(a => ...) will execute before DStream.foreach(b => ...) , but the execution within foreach will run in parallel as regards to the internal RDD being iterated.在您的示例中, DStream.foreach(a => ...)将在DStream.foreach(b => ...)之前执行,但foreach的执行将与正在迭代的内部RDD并行运行。

I want to know that because I want to commit kafka offset after a database insert.我想知道这是因为我想在数据库插入后提交 kafka 偏移量。

The DStream.foreachRDD is an output transformation, meaning it will cause the Spark to materialize the graph and begin execution. DStream.foreachRDD是一个输出转换,这意味着它将导致 Spark 实现图形并开始执行。 You can safely assume that the insertion to the database will end prior to executing your second foreach , but keep in mind that your first foreach will be updating your database in parallel foreach partition in the RDD .您可以放心地假设对数据库的插入将在执行第二个foreach之前结束,但请记住,您的第一个foreach将并行更新RDD foreach 分区中的数据库。

Multiple DStream.foreachRDD are not guaranteed to execute sequentially atleast till spark-streaming-2.4.0.多个 DStream.foreachRDD 不能保证至少在 spark-streaming-2.4.0 之前顺序执行。 Look at this code in JobScheduler class:查看 JobScheduler 类中的这段代码:

class JobScheduler(val ssc: StreamingContext) extends Logging {

  // Use of ConcurrentHashMap.keySet later causes an odd runtime problem due to Java 7/8 diff
  // https://gist.github.com/AlainODea/1375759b8720a3f9f094
  private val jobSets: java.util.Map[Time, JobSet] = new ConcurrentHashMap[Time, JobSet]
  private val numConcurrentJobs = ssc.conf.getInt("spark.streaming.concurrentJobs", 1)
  private val jobExecutor =
    ThreadUtils.newDaemonFixedThreadPool(numConcurrentJobs, "streaming-job-executor")

The JobExecutor is a thread pool and if "spark.streaming.concurrentJobs" is set to a number more than 1, there can be parallel execution if enough spark-executors are available. JobExecutor 是一个线程池,如果“spark.streaming.concurrentJobs”设置为大于 1 的数字,则如果有足够的 spark-executor 可用,则可以并行执行。 So make sure your settings are correct to elicit a behavior you need.因此,请确保您的设置正确以引发您需要的行为。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM