简体   繁体   English

在DStream中并行处理RDD

[英]Processing RDDs in a DStream in parallel

I came across the following code which processes messages in Spark Streaming: 我遇到了以下代码,这些代码在Spark Streaming中处理消息:

val listRDD = ssc.socketTextStream(host, port)
listRDD.foreachRDD(rdd => {
  rdd.foreachPartition(partition => {
    // Should I start a separate thread for each RDD and/or Partition?
    partition.foreach(message => {
      Processor.processMessage(message)
    })
  })
})

This is working for me but I am not sure if this is the best way. 这对我有用,但是我不确定这是否是最好的方法。 I understand that a DStream consists of "one to many" RDDs, but this code processes RDDs sequentially one after the other, right? 我知道DStream由“一对多”的RDD组成,但是此代码一个接一个地顺序处理RDD,对吗? Isn't there a better way - a method or function - that I can use so that all the RDDs in the DStream get processed in parallel? 我是否可以使用更好的方法(方法或函数)来并行处理DStream中的所有RDD? Should I start a separate thread for each RDD and/or Partition? 我应该为每个RDD和/或分区启动一个单独的线程吗? Have I misunderstood how this code works under Spark? 我是否误解了该代码在Spark下的工作方式?

Somehow I think this code is not taking advantage of the parallelism in Spark. 我以某种方式认为此代码没有利用Spark中的并行性。

Streams are partitioned in small RDDs for convenience and efficiency (check micro-batching . But you really don't need to break every RDD into partitions or even break the stream into RDDs. 为了方便和高效,将流划分为小型RDD(请检查微分批处理 。)但实际上,您无需将每个RDD划分为多个分区,甚至不必将流划分为RDD。

It all depends on what Processor.processMessage really is. 这完全取决于Processor.processMessage真正含义。 If it is a single transformation function, you can just do listRDD.map(Processor.processMessage) and you get a stream of whatever the result of processing a message is, computed in parallel with no need for you to do much else. 如果它是单个转换函数,则只需执行listRDD.map(Processor.processMessage)获得一个流,无论处理消息的结果是什么,都可以并行计算,而无需执行其他任何操作。

If Processor is a mutable object that holds state (say, counting the number of messages) then things are more complicated, as you will need to define many such objects to account for parallelism and will also need to somehow merge results later on. 如果Processor是一个保持状态的可变对象(例如,计算消息数),那么事情就更复杂了,因为您将需要定义许多这样的对象以解决并行性,并且以后还需要以某种方式合并结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM