简体   繁体   中英

How to consume from a different Kafka topic in each batch of a Spark Streaming job?

I am pretty sure that there is no simple way of doing this, but here is my use case:

I have a Spark Streaming job (version 2.1.0) with a 5 second duration for each micro batch.

My goal, is to consume data from 1 different topic at every microbatch interval, of a total 250 Kafka topics. You can take the code bellow as a simple example:

 val groupId:String = "first_group"
 val kafka_servers:String =  "datanode1:9092,datanode2:9092,datanode3:9092"

 val ss:SparkSession = SparkSession.builder().config("spark.streaming.unpersist","true").appName("ConsumerStream_test").getOrCreate()
 val ssc:StreamingContext= new StreamingContext(ss.sparkContext,Duration(5000))

val kafka_parameters:Map[String,Object]=Map(
"bootstrap.servers"       -> kafka_servers,
"key.deserializer"        -> classOf[StringDeserializer],
"value.deserializer"      -> classOf[ByteArrayDeserializer],
"heartbeat.interval.ms"   -> (1000:Integer),
"max.poll.interval.ms"    -> (100:Integer),
"enable.auto.commit"      -> (false: java.lang.Boolean),
"autoOffsetReset"         -> OffsetResetStrategy.EARLIEST,
//"connections.max.idle.ms" -> (5000:Integer),
"group.id"                -> groupId
)

val r = scala.util.Random
val kafka_list_one_topic=List("topic_"+ r.nextInt(250))

val consumer:DStream[ConsumerRecord[String,Array[Byte]]] = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferBrokers, ConsumerStrategies.
Subscribe[String, Array[Byte]](kafka_list_one_topic , kafka_parameters))

consumer.foreachRDD( eachRDD => {
     // DOING SOMETHING WITH THE DATA...
  })
ssc.start()
ssc.awaitTermination()

But the issue with this approach, is that Spark will only run the initial code (everything before the foreachRDD command) once, in order to create the Kafka consumer DStream, but in the following micro batch, it only runs the "foreachRDD" statement.

As an example, let's say that r.nextInt(250) returned 40. The Spark Streaming job will connect to topic_40 and process its data. But in the next micro batches, it will still connect to topic_40, and ignore all the commands before the foreachRDD statement.

I guess this is expected, since the code before the foreachRDD statement runs only on the Spark driver.

My question is, is there a way that I can do this without having to relaunch a Spark application every 5 seconds?

Thank you.

My approach would be really simple, if you want it to be really random and don't care about any other consequences, make the kafka_list_one_topic as a mutable variable and change it in the streaming code.

val r = scala.util.Random
var kafka_list_one_topic=List("topic_"+ r.nextInt(250))

val consumer:DStream[ConsumerRecord[String,Array[Byte]]] = 
KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferBrokers, 
ConsumerStrategies.
Subscribe[String, Array[Byte]](kafka_list_one_topic , kafka_parameters))

consumer.foreachRDD( eachRDD => {
 // DOING SOMETHING WITH THE DATA...
 kafka_list_one_topic=List("topic_"+ r.nextInt(250))
 })
ssc.start()
ssc.awaitTermination()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM