简体   繁体   English

Spark DStream的foreachDD函数中RDD的并发转换

[英]Concurrent transformations on RDD in foreachDD function of Spark DStream

In the following code it appears to be that functions fn1 & fn2 are applied to inRDD in sequential manner as I see in the Stages section of Spark Web UI. 在下面的代码中,似乎函数fn1和fn2以顺序方式应用于inRDD,正如我在Spark Web UI的Stages部分中看到的那样。

 DstreamRDD1.foreachRDD(new VoidFunction<JavaRDD<String>>()
 { 
     public void call(JavaRDD<String> inRDD)
        {
          inRDD.foreach(fn1)
          inRDD.foreach(fn2)
        }
 }

How is is different when streaming job is run this way. 流媒体作业以这种方式运行时有何不同。 Are the below functions run in parallel on input Dstream? 以下函数是否在输入Dstream上并行运行?

DStreamRDD1.foreachRDD(fn1)
DStreamRDD2.foreachRDD(fn2)

Both foreach on RDD and foreachRDD on DStream will run sequentially because they are output transformations , meaning they cause the materialization of the graph. RDD上的foreachforeachRDD上的DStream都将按顺序运行,因为它们是输出转换 ,这意味着它们会导致图形的实现。 This would not be the case for any general lazy transformation in Spark, which can run in parallel when the execution graph diverges into multiple separate stages. 对于Spark中的任何常规延迟转换都不会出现这种情况,当执行图分为多个单独的阶段时,它可以并行运行。

For example: 例如:

dStream: DStream[String] = ???
val first = dStream.filter(x => x.contains("h"))
val second = dStream.filter(x => !x.contains("h"))

first.print()
second.print()

The first part need not execute sequentially when you have sufficient cluster resources to run underlying stages in parallel. 当您有足够的群集资源并行运行基础阶段时,第一部分不需要按顺序执行。 Then, calling count , which again is an output transformation will cause the print statements to be printed one after the other. 然后,调用count ,这也是输出转换将导致print语句一个接一个地打印。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM