简体   繁体   English

Spark Streaming 和 Kafka 集成中的并行任务数

[英]Number Of Parallel Task in Spark Streaming and Kafka Integration

I am very new to Spark Streaming.I have some basic doubts..Can some one please help me to clarify this:我是 Spark Streaming 的新手。我有一些基本的疑问。有人可以帮我澄清一下吗:

  1. My message size is standard.1Kb each message.我的消息大小是标准的。每条消息 1Kb。

  2. Number of Topic partitions is 30 and using dstream approach to consume message from kafka.主题分区数为 30,使用 dstream 方法从 kafka 消费消息。

  3. Number of cores given to spark job as:分配给 spark 作业的核心数为:

    ( spark.max.cores=6| spark.executor.cores=2) ( spark.max.cores=6| spark.executor.cores=2)

  4. As I understand that Number of Kafka Partitions=Number of RDD partitions:据我了解,Kafka 分区数 = RDD 分区数:

     In this case dstream approach: dstream.forEachRdd(rdd->{ rdd.forEachPartition{ } **Question**:This loop forEachPartiton will execute 30 times??As there are 30 Kafka partitions

    } }

  5. Also since I have given 6 cores,How many partitions will be consumed in parallel from kafka另外由于我给了6核,从kafka中并行消耗多少分区

    Questions: Is it 6 partitions at a time or问题:是一次6个分区还是
    30/6 =5 partitions at a time? 30/6 =一次5个分区? Can some one please give little detail on it on how this exactly work in dstream approach.有人可以详细说明这在 dstream 方法中的工作原理吗?

"Is it 6 partitions at a time or 30/6 =5 partitions at a time?" “是一次6个分区还是30/6=一次5个分区?”

As you said already, the resulting RDDs within the Direct Stream will match the number of partitions of the Kafka topic.正如您已经说过的,Direct Stream 中生成的 RDD 将匹配 Kafka 主题的分区数。

On each micro-batch Spark will create 30 tasks to read each partition.在每个微批处理上,Spark 将创建 30 个任务来读取每个分区。 As you have set the maximum number of cores to 6 the job is able to read 6 partitions in parallel.由于您已将最大核心数设置为 6,因此作业能够并行读取 6 个分区。 As soon as one of the tasks finishes a new partition can be consumed.一旦其中一项任务完成,就可以使用一个新分区。

Remember, even if you have no new data in on of the partitions, the resulting RDD still get 30 partitions so, yes, the loop forEachPartiton will iterate 30 times within each micro-batch.请记住,即使您在其中一个分区中没有新数据,生成的 RDD 仍然有 30 个分区,所以,是的,循环forEachPartiton将在每个微批次中迭代 30 次。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM