简体   繁体   English

Spark Structured Streaming - kafka 偏移处理

[英]Spark Structured Streaming - kafka offset handling

When I start my Spark Structured Streaming 3.0.1 application from the latest offset it works well.当我从最新的偏移量启动我的 Spark Structured Streaming 3.0.1 应用程序时,它运行良好。 But when I want to start from some earlier offsets - for example:但是当我想从一些较早的偏移量开始时——例如:

  • startingOffsets to "earliest"起始偏移量为“最早”
  • startingOffsets to particular offset like {"MyTopic-v1":{"0":1686734237}}起始偏移量到特定偏移量,例如 {"MyTopic-v1":{"0":1686734237}}

I can see in the logs that the starting offset gets picked up correctly, but then a series of seeks is happening (starting from my defined position) until it reaches the current latest offset.我可以在日志中看到起始偏移量被正确拾取,但是随后发生了一系列搜索(从我定义的位置开始),直到它达到当前的最新偏移量。

I dropped my checkpoint directory and tried several options but the scenario is always the same - it reports correct starting offset, but then takes a very long time just to slowly seek to the most recent and start processing - any idea why and what I should additionally check?我删除了我的检查点目录并尝试了几个选项,但情况总是相同的 - 它报告正确的起始偏移量,但是需要很长时间才能慢慢寻找最新的并开始处理 - 知道为什么以及我应该另外查看?

2021-02-19 14:52:23 INFO  KafkaConsumer:1564 - [...] Seeking to offset 1786734237 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO  KafkaConsumer:1564 - [...] Seeking to offset 1786734737 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO  KafkaConsumer:1564 - [...] Seeking to offset 1786735237 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO  KafkaConsumer:1564 - [...] Seeking to offset 1786735737 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO  KafkaConsumer:1564 - [...] Seeking to offset 1786736237 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO  KafkaConsumer:1564 - [...] Seeking to offset 1786736737 for partition MyTopic-v1-0
2021-02-19 14:52:23 INFO  KafkaConsumer:1564 - [...] Seeking to offset 1786737237 for partition MyTopic-v1-0

I left the application running for longer time and it started producing the files eventually, but my processing trigger of 100 seconds was not met, the data showed up much later - after 20-30minutes.我让应用程序运行了更长的时间,它最终开始生成文件,但我的 100 秒处理触发器没有得到满足,数据显示得更晚 - 20-30 分钟后。

(I tested it also on spark 2.4.5 - the same problem - maybe it's some kafka configuration?) (我也在 spark 2.4.5 上测试过——同样的问题——也许是一些 kafka 配置?)

Using the option startingOffsets with a JSON object as you showed should work perfectly fine.如您所示,使用带有 JSON object 的选项startingOffsets应该可以正常工作。

What you have observed, is that on the first start of the application the Structured Streaming job will read all(.) offsets from the provided (1686734237) until the last available offset in the topic.您观察到的是,在应用程序第一次启动时,结构化流作业将读取所有(.)从提供的(1686734237)偏移量,直到主题中最后一个可用的偏移量。 As this can be quite a high number of messages the processing of that big chunk will keep the first micro-batch quite busy.由于这可能是相当多的消息,因此处理该大块将使第一个微批处理非常繁忙。

Remember that the Trigger option just defines the triggering frequency of the micro-batch.请记住, Trigger选项只定义了微批次的触发频率。 You should make sure to align this Trigger rate with the expected processing time.您应该确保将此触发率与预期的处理时间保持一致。 I see basically two options here:我在这里基本上看到两个选项:

  • use option maxOffsetsPerTriger to limit the amount of offsets fetched from Kafka per Trigger / Micro-Batch使用选项maxOffsetsPerTriger来限制每个触发器/微批次从 Kafka 获取的偏移量
  • Avoid using any Trigger as this will allow your stream, by default, to trigger as soon as the previous trigger has finished processing the data避免使用任何触发器,因为这将允许您的 stream 默认情况下在前一个触发器完成数据处理后立即触发

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark Structured Streaming Kafka Offset 管理 - Spark Structured Streaming Kafka Offset Management Spark结构化流Kafka集成偏移管理 - Spark Structured Streaming Kafka Integration Offset management Spark Structured Streaming Kafka 错误——偏移量已更改 - Spark Structured Streaming Kafka error -- offset was changed 从Kafka倒转偏移Spark结构化流 - Rewind Offset Spark Structured Streaming from Kafka Spark Structured Streaming NOT 处理 Kafka 偏移量过期 - Spark Structured Streaming NOT process Kafka offset expires 在使用 Kafka 的 Spark Structured streaming 中,Spark 如何管理多个主题的偏移量 - In Spark Structured streaming with Kafka, how spark manages offset for multiple topics 使用Spark结构化流式2.2批处理API的Kafka偏移量管理 - Kafka offset management with Spark structured streaming 2.2 batch API Kafka protobuf 的 Spark 结构化流 - Spark structured streaming of Kafka protobuf 处理数据 - Spark结构化流媒体 - Handling data - Spark structured streaming 在同一个 spark session 中运行多个 Spark Kafka Structured Streaming 查询增加偏移量但显示 numInputRows 0 - Running multiple Spark Kafka Structured Streaming queries in same spark session increasing the offset but showing numInputRows 0
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM