简体   繁体   中英

Spark Streaming data dissemination in Kafka and TextSocket Stream

I want to understand how data is read from text socket streams or Kafka input in Spark Streaming.

  1. Is data read from the driver in a single thread and then disseminated to the workers? Wouldn't a single point of data reading become a bottleneck?

  2. Do all workers read the data in parallel? If so how is the read synchronized?

1) No, data is read directly by the executors. They open up their own connections to the appropriate brokers given what partitions they cover. See next point.

2) Each executor (assuming more than one) has a subset of the partitions for a given topic. If there are 2 partitions and you have 2 executors, each executor will get 1 partition. If you only had 1 partition, then 1 executor would get all the data, 1 would get nothing. In Kafka, you are only guaranteed messages will be delivered in order within a partition, and absent magic Spark can do no better.

Is data read from the driver in a single thread and then disseminated to the workers? Wouldn't a single point of data reading become a bottleneck?

No, generally that's not how it's done. With Kafka, you can choose between two approaches:

  1. Receiver based stream - Spark workers running receivers which are basically connections to kafka. They read data and use WAL and update ZooKeeper for offsets. This approach requires you to spin up multiple receivers for concurrent read from Kafka. This is typically done by creating multiple DStreams and then using DStream.union to unify all data sources.

  2. Receiverless based stream - This is the new API launched with the release of Spark 1.3.0. This approach has the driver reading offsets into the different Kafka partitions, and launching jobs with specific offsets to each worker. This approach doesn't require you to open up concurrent connections to your kafka cluster, it will open up a connection per Kafka partition for you. This makes it simple for the worker to query Kafka with the required range. However, this approach does not store offsets to ZooKeeper. Instead ,offsets are reliably checkpointed using sparks check pointing mechanism for fault tolerance.

Do all workers read the data in parallel? If so how is the read synchronized?

That depends on which of the above options for reading you choose. If, for example, you choose the receiver based approach and only start a single connection to Kafka, you'll have a single worker consuming all the data. In the receiverless approach, multiple connections already open up on your behalf and are distributed to different workers.

I suggest reading up a great blog post by DataBricks: Improvements to Kafka integration of Spark Streaming , and the Spark Streaming + Kafka integration documentation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM