简体   繁体   中英

Spark: Is receiver in spark streaming a bottleneck?

I would like to understand how the receiver works in Spark Streaming. As per my understanding there will be a receiver tasks running in executors that collect data and saves as RDD's. Receivers start reading when the start() is called. Need clarification on the following.

  1. How many receivers does the Spark Streaming job starts? Multiple or One?
  2. Is the receiver is implemented as push based or pull based?
  3. In any case does the receiver can become a bottleneck?
  4. To achieve the degree of parallelism the data should be partitioned across the worker nodes. So for the streaming data the how the data is distributed across the nodes.
  5. If a new RDD's are formed on an new node based on batch time interval, how does SparkContext serialize the transform functions to the node after the Job is submitted?
  6. Can the amount of receivers launch be controlled by a parameter?

Would like to know the anatomy of Spark Streaming and receiver.

I'm going to answer based on my experience with Kafka receivers, which seems more or less similar to what goes on in Kinesis.

How many receivers does the Spark Streaming job starts?. Multiple or One.

Each receiver you open is a single connection. In Kafka, if you want to read concurrently from multiple partitions, you need to open up multiple receivers, and usually union them together.

Is the receiver is implemented as push based or pull based?

Pull. In Spark Streaming, each batch interval (specified when creating the StreamingContext ) pulls data from Kafka.

In any case does the receiver can become a bottleneck?

Broad question. It depends. If your batch intervals are long and you have a single receiver, your backlog may start to fill. It's mainly trail and error until you reach the optimum balance in your streaming job.

To achieve the degree of parallelism the data should be partitioned across the worker nodes. So for the streaming data the how the data is distributed across the nodes.

You can create concurrency as I previously stated by opening multiple receivers to the underlying data source. Further, after reading the data, it can be repartitioned using the standard Spark mechanisms for partitioning data.

If new RDDs are formed on an new node based on batch time interval, how does SparkContext serialize the transform functions to the node after the Job is submitted.

The same way it serializes each Task in the stages, by using the serializer of choice and sending data over the wire. Not sure I understand what you mean here.

Can the amount of receivers launch be controlled by a parameter?

Yes, you can have a configuration parameter which determines the number of receivers you open. Such code can look like this:

// This may be your config parameter
val numStreams = 5
val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...) }

val unifiedStream = streamingContext.union(kafkaStreams)
unifiedStream.print()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM