简体繁体 English

Kafka和TextSocket Stream中的Spark Streaming数据传播

[英]Spark Streaming data dissemination in Kafka and TextSocket Stream

原文 2016-03-11 13:58:56 8 2 java/ apache-kafka/ spark-streaming

I want to understand how data is read from text socket streams or Kafka input in Spark Streaming. 我想了解如何从文本套接字流或Spark Streaming中的Kafka输入读取数据。

Is data read from the driver in a single thread and then disseminated to the workers? 是从单个线程中的驱动程序读取数据，然后传播给工作人员？ Wouldn't a single point of data reading become a bottleneck? 单点数据读取不会成为瓶颈吗？
Do all workers read the data in parallel? 所有工人是否并行读取数据？ If so how is the read synchronized? 如果是这样，读取如何同步？

2 个解决方案

1) No, data is read directly by the executors. 1）否，数据由执行者直接读取。 They open up their own connections to the appropriate brokers given what partitions they cover. 考虑到他们所覆盖的分区，他们打开自己与相应经纪人的联系。 See next point. 见下一点。

2) Each executor (assuming more than one) has a subset of the partitions for a given topic. 2）每个执行器（假设多于一个）具有给定主题的分区的子集。 If there are 2 partitions and you have 2 executors, each executor will get 1 partition. 如果有2个分区，并且您有2个执行程序，则每个执行程序将获得1个分区。 If you only had 1 partition, then 1 executor would get all the data, 1 would get nothing. 如果你只有1个分区，那么1个执行器将获得所有数据，1个什么都得不到。 In Kafka, you are only guaranteed messages will be delivered in order within a partition, and absent magic Spark can do no better. 在Kafka中，您只能保证消息将在分区内按顺序传递，并且缺少魔法Spark可以做得更好。

Is data read from the driver in a single thread and then disseminated to the workers? 是从单个线程中的驱动程序读取数据，然后传播给工作人员？ Wouldn't a single point of data reading become a bottleneck? 单点数据读取不会成为瓶颈吗？

No, generally that's not how it's done. 不，通常不是这样做的。 With Kafka, you can choose between two approaches: 使用Kafka，您可以选择两种方法：

Receiver based stream - Spark workers running receivers which are basically connections to kafka. 基于接收器的流 - Spark工作者运行接收器，这些接收器基本上是与kafka的连接。 They read data and use WAL and update ZooKeeper for offsets. 他们读取数据并使用WAL并更新ZooKeeper以获得偏移量。 This approach requires you to spin up multiple receivers for concurrent read from Kafka. 这种方法要求您启动多个接收器以便从Kafka进行并发读取。 This is typically done by creating multiple DStreams and then using DStream.union to unify all data sources. 这通常通过创建多个DStream然后使用DStream.union来统一所有数据源来完成。
Receiverless based stream - This is the new API launched with the release of Spark 1.3.0. 基于Receiverless的流 - 这是随Spark 1.3.0发布的新API。 This approach has the driver reading offsets into the different Kafka partitions, and launching jobs with specific offsets to each worker. 此方法使驱动程序将偏移量读入不同的Kafka分区，并向每个工作人员启动具有特定偏移量的作业。 This approach doesn't require you to open up concurrent connections to your kafka cluster, it will open up a connection per Kafka partition for you. 此方法不要求您打开与kafka群集的并发连接，它将为您打开每个Kafka分区的连接。 This makes it simple for the worker to query Kafka with the required range. 这使得工作人员可以轻松地查询具有所需范围的Kafka。 However, this approach does not store offsets to ZooKeeper. 但是，此方法不会存储ZooKeeper的偏移量。 Instead ,offsets are reliably checkpointed using sparks check pointing mechanism for fault tolerance. 相反，使用火花检查指向机制可靠地检查偏移以实现容错。

Do all workers read the data in parallel? 所有工人是否并行读取数据？ If so how is the read synchronized? 如果是这样，读取如何同步？

That depends on which of the above options for reading you choose. 这取决于您选择的上述哪个选项。 If, for example, you choose the receiver based approach and only start a single connection to Kafka, you'll have a single worker consuming all the data. 例如，如果您选择基于接收器的方法并且仅启动与Kafka的单个连接，那么您将有一名工作人员消耗所有数据。 In the receiverless approach, multiple connections already open up on your behalf and are distributed to different workers. 在无接收方法中，多个连接已经代表您开放并分发给不同的工作人员。

I suggest reading up a great blog post by DataBricks: Improvements to Kafka integration of Spark Streaming , and the Spark Streaming + Kafka integration documentation. 我建议阅读DataBricks的一篇很棒的博客文章：改进Kafka集成的Spark Streaming ，以及Spark Streaming + Kafka集成文档。