简体繁体 English

使用Spark Streaming处理Kafka消息时遇到的挑战

[英]Challenges while processing Kafka Messages with Spark Streaming

原文 2017-02-23 04:33:51 0 1 apache-spark/ apache-kafka/ spark-streaming/ bigdata

I want to process the messages reported at a web server in real time. 我想实时处理在Web服务器上报告的消息。 The messages reported at web server belong to different sessions and I want to do some session level aggregations. Web服务器上报告的消息属于不同的会话，我想进行一些会话级别的聚合。 For this purpose I plan to use Spark Streaming front ended by Kafka. 为此，我计划使用Kafka的Spark Streaming前端。 Even before I start, I have listed down a few challenges which this architecture is going to throw. 甚至在我开始之前，我就列出了该体系结构将面临的一些挑战。 Can someone familiar with this ecosystem help me out with these questions: 熟悉这个生态系统的人可以帮助我解决以下问题：

If each Kafka message belongs to a particular session, how to manage session affinity so that the same Spark executor sees all the messages linked to a session? 如果每个Kafka消息都属于一个特定的会话，那么如何管理会话亲缘关系，以便同一Spark执行程序可以查看链接到会话的所有消息？
How to ensure that messages belonging to a session are processed by a Spark executor in the order they were reported at Kafka? 如何确保属于会话的消息由Spark执行程序按在Kafka上报告的顺序进行处理？ Can we somehow achieve this without putting a constraint on thread count and incurring processing overheads (like sorting by message timestamp)? 我们可以以某种方式实现这一目标而不必限制线程数和增加处理开销（如按消息时间戳排序）吗？
When to checkpoint session state? 什么时候检查点会话状态？ How state is resurrected from last checkpoint in case of executor node crash? 如果执行程序节点崩溃，如何从最后一个检查点恢复状态？ How state is resurrected from last checkpoint in case of driver node crash? 如果驱动程序节点崩溃，如何从最后一个检查点恢复状态？
How state is resurrected if a node(executor/driver) crashes before checkpointing its state? 如果节点（执行程序/驱动程序）在检查点状态之前崩溃，如何恢复状态？ If Spark recreates state RDD by replaying messages then where does it start replaying the Kafka messages from: last checkpoint on wards or does it process all the messages needed to recreate the partition? 如果Spark通过重播消息来重新创建RDD状态，那么它从哪里开始从以下位置开始重播Kafka消息：病房的最后一个检查点，或者它处理重新创建分区所需的所有消息？ Can/does Spark streaming resurrect state across multiple spark streaming batches or only for the current batch ie can the state be recovered if checkpointing was not done during the last batch? Spark Streaming是否可以在多个Spark Streaming批次之间或仅在当前批次中恢复状态，即，如果在最后一个批次中未执行检查点，则可以恢复状态吗？

1 个解决方案

If each Kafka message belongs to a particular session, how to manage session affinity so that the same Spark executor sees all the messages linked to a session? 如果每个Kafka消息都属于一个特定的会话，那么如何管理会话亲缘关系，以便同一Spark执行程序可以查看链接到会话的所有消息？

Kafka divides topics into partitions, and every partition can only be read by one consumer at a time, so you need to make sure that all messages belonging to one session go into the same partition. Kafka将主题划分为多个分区，每个分区一次只能由一个使用者读取，因此您需要确保属于一个会话的所有消息都进入同一分区。 Partition assignment is controlled via the key that you assign to every message, so the easiest way to achieve this would probably be to use the session id as key when sending data. 分区分配是通过分配给每条消息的密钥控制的，因此实现此目的的最简单方法可能是在发送数据时使用会话ID作为密钥。 That way the same consumer will get all messages for one session. 这样，同一使用者将获得一个会话的所有消息。 There is one caveat though: Kafka will rebalance assignment of partitions to consumers, when a consumer joins or leaves the consumergroup. 但是有一个警告：当消费者加入或离开消费者组时，Kafka将重新分配分配给消费者的分区。 If this happens mid-session, it can (and will) happen, that half the messages for that session go to one consumer and the other half go to a different consumer after the rebalance. 如果这发生在会话中间，则可能（并且将会）发生，在重新平衡后，该会话的一半消息传递给一个使用者，另一半消息传递给另一个使用者。 To avoid this, you'll need to manually subscribe to specific partitions in your code so that every processor has its specific set of partitions and does not change those. 为避免这种情况，您将需要手动预订代码中的特定分区，以便每个处理器都有其特定的分区集，并且不会更改这些分区。 Have a look at ConsumerStrategies.Assign in the SparkKafka Component Code for this. 看一下ConsumerStrategies 。为此在SparkKafka组件代码中分配。

How to ensure that messages belonging to a session are processed by a Spark executor in the order they were reported at Kafka? 如何确保属于会话的消息由Spark执行程序按在Kafka上报告的顺序进行处理？ Can we somehow achieve this without putting a constraint on thread count and incurring processing overheads (like sorting by message timestamp)? 我们可以以某种方式实现这一目标而不必限制线程数和增加处理开销（如按消息时间戳排序）吗？

Kafka preserves ordering per partition, so there is not much you need to do here. Kafka会保留每个分区的顺序，因此您无需在此做太多事情。 The only thing is to avoid having multiple requests from the producer to the broker at the same time, which you can configure via the producer parameter max.in.flight.requests.per.connection . 唯一的办法是避免同时从生产者向代理发出多个请求，您可以通过生产者参数max.in.flight.requests.per.connection进行配置。 As long as you keep this at 1, you should be safe if I understand your setup correctly. 只要您将此设置为1，如果我正确理解您的设置，就应该安全。

When to checkpoint session state? 什么时候检查点会话状态？ How state is resurrected from last checkpoint in case of executor node crash? 如果执行程序节点崩溃，如何从最后一个检查点恢复状态？ How state is resurrected from last checkpoint in case of driver node crash? 如果驱动程序节点崩溃，如何从最后一个检查点恢复状态？

I'd suggest reading the offset storage section of the Spark Streaming + Kafka Integration Guide , which should answer a lot of questions already. 我建议阅读《 Spark Streaming + Kafka集成指南》的偏移量存储部分，该部分应该已经回答了很多问题。

The short version is, you can persist your last read offset into Kafka and should definitely do this whenever you checkpoint your executors. 简短的版本是，您可以将最后一次读取的偏移量保留到Kafka中，并且一定要在检查点执行者时执行此操作。 That way, whenever a new executor picks up processing, no matter whether it was restored from a checkpoint or not, it will know where to read from in Kafka. 这样，无论何时新的执行者接受处理，无论它是否从检查点恢复，它都将知道在Kafka中从何处读取。

How state is resurrected if a node(executor/driver) crashes before checkpointing its state? 如果节点（执行程序/驱动程序）在检查点状态之前崩溃，如何恢复状态？ If Spark recreates state RDD by replaying messages then where does it start replaying the Kafka messages from: last checkpoint on wards or does it process all the messages needed to recreate the partition? 如果Spark通过重播消息来重新创建RDD状态，那么它从哪里开始从以下位置开始重播Kafka消息：病房的最后一个检查点，或者它处理重新创建分区所需的所有消息？ Can/does Spark streaming resurrect state across multiple spark streaming batches or only for the current batch ie can the state be recovered if checkpointing was not done during the last batch? Spark Streaming是否可以在多个Spark Streaming批次之间或仅在当前批次中恢复状态，即，如果在最后一个批次中未执行检查点，则可以恢复状态吗？

My Spark knowledge here is a bit shaky, but I would say that this not something that is done by Kafka/Spark but rather something that you actively need to influence with your code. 我在这里的Spark知识有些不稳定，但是我想说这不是Kafka / Spark完成的工作，而是您需要积极影响代码的工作。 By default, if a new Kafka Stream is started up and finds no previous committed offset, it will simply start reading from the end of the topic, so it would get any message that is produced after the consumer is started. 默认情况下，如果启动了新的Kafka Stream并没有找到以前提交的偏移量，它将仅从主题末尾开始读取，因此它将获得使用者启动后产生的任何消息。 If you need to resurrect state, then you'd either need to know from what exact offset you want to start re-reading messages, or just start reading from the beginning again. 如果您需要恢复状态，则需要知道要从哪个确切偏移量开始重新读取消息，或者只是从头开始读取。 You can pass an offset to read from into the above mentioned .Assign() method, when distributing partitions. 分配分区时，可以将偏移量读取到上述.Assign（）方法中。

I hope this helps a bit, I am sure it is by no means a complete answer to all questions, but it is a fairly wide field to work, let me know if I can be of further help. 我希望这会有所帮助，我确信这绝不是所有问题的完整答案，但是这是一个相当广泛的工作领域，请告诉我是否可以提供进一步的帮助。