Spark Structured Streaming - kafka 偏移处理

Question

When I start my Spark Structured Streaming 3.0.1 application from the latest offset it works well.当我从最新的偏移量启动我的 Spark Structured Streaming 3.0.1 应用程序时，它运行良好。 But when I want to start from some earlier offsets - for example:但是当我想从一些较早的偏移量开始时——例如：

startingOffsets to "earliest"起始偏移量为“最早”
startingOffsets to particular offset like {"MyTopic-v1":{"0":1686734237}}起始偏移量到特定偏移量，例如 {"MyTopic-v1":{"0":1686734237}}

I can see in the logs that the starting offset gets picked up correctly, but then a series of seeks is happening (starting from my defined position) until it reaches the current latest offset.我可以在日志中看到起始偏移量被正确拾取，但是随后发生了一系列搜索（从我定义的位置开始），直到它达到当前的最新偏移量。

I dropped my checkpoint directory and tried several options but the scenario is always the same - it reports correct starting offset, but then takes a very long time just to slowly seek to the most recent and start processing - any idea why and what I should additionally check?我删除了我的检查点目录并尝试了几个选项，但情况总是相同的 - 它报告正确的起始偏移量，但是需要很长时间才能慢慢寻找最新的并开始处理 - 知道为什么以及我应该另外查看？

2021-02-19 14:52:23 INFO  KafkaConsumer:1564 - [...] Seeking to offset 1786734237 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO  KafkaConsumer:1564 - [...] Seeking to offset 1786734737 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO  KafkaConsumer:1564 - [...] Seeking to offset 1786735237 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO  KafkaConsumer:1564 - [...] Seeking to offset 1786735737 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO  KafkaConsumer:1564 - [...] Seeking to offset 1786736237 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO  KafkaConsumer:1564 - [...] Seeking to offset 1786736737 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO  KafkaConsumer:1564 - [...] Seeking to offset 1786737237 for partition MyTopic-v1-0

I left the application running for longer time and it started producing the files eventually, but my processing trigger of 100 seconds was not met, the data showed up much later - after 20-30minutes.我让应用程序运行了更长的时间，它最终开始生成文件，但我的 100 秒处理触发器没有得到满足，数据显示得更晚 - 20-30 分钟后。

(I tested it also on spark 2.4.5 - the same problem - maybe it's some kafka configuration?) （我也在 spark 2.4.5 上测试过——同样的问题——也许是一些 kafka 配置？）

Answer 1

Using the option startingOffsets with a JSON object as you showed should work perfectly fine.如您所示，使用带有 JSON object 的选项startingOffsets应该可以正常工作。

What you have observed, is that on the first start of the application the Structured Streaming job will read all(.) offsets from the provided (1686734237) until the last available offset in the topic.您观察到的是，在应用程序第一次启动时，结构化流作业将读取所有（.）从提供的（1686734237）偏移量，直到主题中最后一个可用的偏移量。 As this can be quite a high number of messages the processing of that big chunk will keep the first micro-batch quite busy.由于这可能是相当多的消息，因此处理该大块将使第一个微批处理非常繁忙。

Remember that the Trigger option just defines the triggering frequency of the micro-batch.请记住， Trigger选项只定义了微批次的触发频率。 You should make sure to align this Trigger rate with the expected processing time.您应该确保将此触发率与预期的处理时间保持一致。 I see basically two options here:我在这里基本上看到两个选项：

use option maxOffsetsPerTriger to limit the amount of offsets fetched from Kafka per Trigger / Micro-Batch使用选项maxOffsetsPerTriger来限制每个触发器/微批次从 Kafka 获取的偏移量
Avoid using any Trigger as this will allow your stream, by default, to trigger as soon as the previous trigger has finished processing the data避免使用任何触发器，因为这将允许您的 stream 默认情况下在前一个触发器完成数据处理后立即触发

Spark Structured Streaming - kafka 偏移处理

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-02-22 12:06:53

Spark Structured Streaming - kafka 偏移处理

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-02-22 12:06:53

解决方案1
1 已采纳 2021-02-22 12:06:53