简体繁体中英

Spark Streaming data dissemination in Kafka and TextSocket Stream

原文 2016-03-11 13:58:56 7 2 java/ apache-kafka/ spark-streaming

I want to understand how data is read from text socket streams or Kafka input in Spark Streaming.

Is data read from the driver in a single thread and then disseminated to the workers? Wouldn't a single point of data reading become a bottleneck?
Do all workers read the data in parallel? If so how is the read synchronized?

2 answers

1) No, data is read directly by the executors. They open up their own connections to the appropriate brokers given what partitions they cover. See next point.

2) Each executor (assuming more than one) has a subset of the partitions for a given topic. If there are 2 partitions and you have 2 executors, each executor will get 1 partition. If you only had 1 partition, then 1 executor would get all the data, 1 would get nothing. In Kafka, you are only guaranteed messages will be delivered in order within a partition, and absent magic Spark can do no better.

Is data read from the driver in a single thread and then disseminated to the workers? Wouldn't a single point of data reading become a bottleneck?

No, generally that's not how it's done. With Kafka, you can choose between two approaches:

Receiver based stream - Spark workers running receivers which are basically connections to kafka. They read data and use WAL and update ZooKeeper for offsets. This approach requires you to spin up multiple receivers for concurrent read from Kafka. This is typically done by creating multiple DStreams and then using DStream.union to unify all data sources.
Receiverless based stream - This is the new API launched with the release of Spark 1.3.0. This approach has the driver reading offsets into the different Kafka partitions, and launching jobs with specific offsets to each worker. This approach doesn't require you to open up concurrent connections to your kafka cluster, it will open up a connection per Kafka partition for you. This makes it simple for the worker to query Kafka with the required range. However, this approach does not store offsets to ZooKeeper. Instead ,offsets are reliably checkpointed using sparks check pointing mechanism for fault tolerance.

Do all workers read the data in parallel? If so how is the read synchronized?

That depends on which of the above options for reading you choose. If, for example, you choose the receiver based approach and only start a single connection to Kafka, you'll have a single worker consuming all the data. In the receiverless approach, multiple connections already open up on your behalf and are distributed to different workers.

I suggest reading up a great blog post by DataBricks: Improvements to Kafka integration of Spark Streaming , and the Spark Streaming + Kafka integration documentation.

Spark Streaming Kafka Stream batch execution

spark streaming with kafka one consumer is reading the data

How to implement custom deserializer for Kafka stream using Spark structured streaming?

how to specify consumer group in Kafka Spark Streaming using direct stream

Spark Kafka Streaming Issue

Kafka Spark Streaming cache

Kafka consumer in Spark Streaming

Spark Streaming Kafka Consumer

Apache Spark - capturing Kafka data on streaming event to trigger workflow

Generate data with apache kafka and receive it using spark streaming

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Spark Streaming Kafka Stream batch execution spark streaming with kafka one consumer is reading the data How to implement custom deserializer for Kafka stream using Spark structured streaming? how to specify consumer group in Kafka Spark Streaming using direct stream Spark Kafka Streaming Issue Kafka Spark Streaming cache Kafka consumer in Spark Streaming Spark Streaming Kafka Consumer Apache Spark - capturing Kafka data on streaming event to trigger workflow Generate data with apache kafka and receive it using spark streaming

Related Tags

Spark Streaming data dissemination in Kafka and TextSocket Stream

Question

2 answers

solution1
1 2016-03-11 14:45:33

solution2
1 ACCPTED 2016-03-11 15:34:20

Spark Streaming data dissemination in Kafka and TextSocket Stream

Question

2 answers

solution1 1 2016-03-11 14:45:33

solution2 1 ACCPTED 2016-03-11 15:34:20

solution1
1 2016-03-11 14:45:33

solution2
1 ACCPTED 2016-03-11 15:34:20