简体繁体 English

如何在Spark本身中使用Kafka在Spark流中实现偏移管理？

[英]How do I implement offset management in Spark streaming with Kafka inside Spark itself?

原文 2019-07-05 12:28:53 6 1 java/ apache-kafka/ spark-streaming

I have to implement offset management within Spark for a streaming job in Java which reads from a Kafka stream. 我必须在Spark中为从Kafka流读取的Java中的流作业实现偏移管理。 However, though the process has been described in the official documentation here , it does not actually give a code example of how to actually store and retrieve offsets from checkpoints. 但是，尽管此过程已在此处的官方文档中进行了描述，但实际上并没有提供如何实际存储和检索检查点偏移量的代码示例。 Rather, it cryptically says that 相反，它暗暗地说

If you enable Spark checkpointing, offsets will be stored in the checkpoint. 如果启用Spark检查点，则偏移量将存储在检查点中。

Does this mean that if I just provide the checkpoint directory to the Spark context, it would automatically store offsets? 这是否意味着如果我仅将检查点目录提供给Spark上下文，它将自动存储偏移量？ What about retrieval of the last offset read when the application comes back on? 当应用程序重新启动时，如何检索最后读取的偏移量呢？ The detail page on checkpointing that is linked there also leaves everything to the reader and only gives the syntax to set the checkpoint destination. 在此处链接的有关检查点的详细信息页面也将所有内容留给了读者，仅提供了设置检查点目的地的语法。

This and this give some clue as to how to use checkpoints, but in all of the instances, I can find that they have been used to cumulatively calculate something and not to store offsets. 这样，这为如何使用检查点提供了一些线索，但是在所有情况下，我都可以发现它们已用于累积计算内容而不存储偏移量。 This question comes close, but still does not describe it. 这个问题接近，但仍未描述。

Please help me in realizing this goal. 请帮助我实现这一目标。

1 个解决方案

Saving Offsets in checkpoint does not work for you, because spark will save tasks in the check point so up-gradation of the code requires to delete the checkpoint. 将偏移量保存在检查点中对您不起作用，因为spark会将任务保存在检查点中，因此代码的升级要求删除检查点。 Instead you can save the offsets in Zookeeper,Kafka,File-System or any database. 相反，您可以将偏移量保存在Zookeeper，Kafka，文件系统或任何数据库中。