Reading and processing parallelism in Kafka Spark streaming

Question

I am trying to parallelize reading of Kafka messages thereby processing them in parallel. My Kafka topic has 10 partitions. I'm trying to create 5 DStreams and applying Union method to operate on a single DStream. Here is the code I tried so far:

  def main(args: scala.Array[String]): Unit = {

    val properties = readProperties()

    val streamConf = new SparkConf().setMaster("local[2]").setAppName("KafkaStream")
    val ssc = new StreamingContext(streamConf, Seconds(1))
    //  println("defaultParallelism: "+ssc.sparkContext.defaultParallelism)
    ssc.sparkContext.setLogLevel("WARN")
    val numPartitionsOfInputTopic = 5
    val group_id = Random.alphanumeric.take(4).mkString("consumer_group")
    val kafkaStream = {
      val kafkaParams = Map("zookeeper.connect" -> properties.getProperty("zookeeper_connection_str"),
                            "group.id" -> group_id,
                            "zookeeper.connection.timeout.ms" -> "3000")                      

      val streams = (1 to numPartitionsOfInputTopic).map { _ =>
                          KafkaUtils.createStream[scala.Array[Byte], String, DefaultDecoder, StringDecoder](
                          ssc, kafkaParams, Map("kafka_topic" -> 1), StorageLevel.MEMORY_ONLY_SER).map(_._2)
      }
    val unifiedStream = ssc.union(streams)
    val sparkProcessingParallelism = 5 
    unifiedStream.repartition(sparkProcessingParallelism)
    }

  kafkaStream.foreachRDD { x => 
  x.foreach {       
      msg => println("Message: "+msg)
      processMessage(msg)
   }      
  }

  ssc.start()
  ssc.awaitTermination()
}

Upon execution, it's not even receiving a single message, let alone processing it further. Am I missing something here? Please suggest for changes if required. Thanks.

Answer 1

I highly recommend to switch to Direct Stream. Why?

Direct Stream by default sets the parallelism to the number of partition you have in Kafka. Nothing more must be done - just create Direct Stream and do your job :)

If you create 5 DStreams, you will by default read in 5 thread, one non-Direct-DStream = one thread

Reading and processing parallelism in Kafka Spark streaming

Question

1 answers

solution1
0 2017-09-16 18:28:45

Reading and processing parallelism in Kafka Spark streaming

Question

1 answers

solution1 0 2017-09-16 18:28:45

solution1
0 2017-09-16 18:28:45