简体   繁体   中英

Spark streaming. Reading in parallel from Kafka is causing repeated data

I ave the following code that creates 6 input DStreams that read from a 6 partition topic from Kafka using direct aproach I found that even specifying the same group ID for the streams I get the data repeated 6 times. If I create only 3 DStreams I get the data repeated 3 times and so on....

numStreams = 6
kafkaStreams = [KafkaUtils.createDirectStream(ssc, ["send6partitions"], {
  "metadata.broker.list": brokers,
  "fetch.message.max.bytes": "20971520",
  "spark.streaming.blockInterval" : "2000ms",
  "group.id" : "the-same"},
  valueDecoder = decodeValue, keyDecoder = decode_key) for _ in range (numStreams)]

kvs = ssc.union(*kafkaStreams)

What I'm doing wrong here?

I'm not familiar with Python, but the Direct Stream in Spark Scala does not commit any offsets. So if you open a stream n times without committing the offset of any read message, your consumer will start at the beginning.

If it is the same in python, you will not need to start n streams. Start one stream, Spark will handle the distribution of partitions to executors/tasks itself.

In Direct approach you shoudn't create many DStreams from one topic.

From the documentation :

Simplified Parallelism: No need to create multiple input Kafka streams and union them. With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel. So there is a one-to-one mapping between Kafka and RDD partitions, which is easier to understand and tune.

So just create one DStream, Spark will use all Kafka partitions :)

Basically Kafka topics are portioned to make distribution faster for multiple receivers/consumers by sharing the load.By default when ever you create Dstream one receiver will run and receive data from each Kafka topic partition to Dstream partitions parallelly by receiver threads(Java thread). If you are creating 6 Dstreams for one topic means 6 receivers for same topic It doesn't mean that each Dstream for each portition. Each receiver get every feed one time so you are getting 6 times each feed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM