简体   繁体   English

Spark 3 结构化流在 Kafka 源代码中使用 maxOffsetsPerTrigger 和 Trigger.Once

[英]Spark 3 structured streaming use maxOffsetsPerTrigger in Kafka source with Trigger.Once

We need to use maxOffsetsPerTrigger in the Kafka source with Trigger.Once() in structured streaming but based on this issue it seems reads allAvailable in spark 3. Is there a way for achieving rate limit in this situation?我们需要在 Kafka 源代码中使用maxOffsetsPerTrigger和结构化流中的Trigger.Once()但基于此问题,它似乎在 spark 3 中读取allAvailable 。在这种情况下有没有办法实现速率限制?

Here is a sample code in spark 3:这是 spark 3 中的示例代码:

def options: Map[String, String] = Map(
  "kafka.bootstrap.servers" -> conf.getStringSeq("bootstrapServers").mkString(","),
  "subscribe" -> conf.getString("topic")
) ++
  Option(conf.getLong("maxOffsetsPerTrigger")).map("maxOffsetsPerTrigger" -> _.toString)
val streamingQuery = sparkSession.readStream.format("kafka").options(options)
  .load
  .writeStream
  .trigger(Trigger.Once)
  .start()

There is no other way around it to properly set a rate limit.没有其他方法可以正确设置速率限制。 If the maxOffsetsPerTrigger is not applicable for streaming jobs with the Once trigger you could do the following to achieve identical result:如果maxOffsetsPerTrigger不适用于使用Once触发器的流作业,您可以执行以下操作以获得相同的结果:

  1. Choose another trigger and use maxOffsetsPerTrigger to limit the rate and kill this job manually after it finished processing all data.选择另一个触发器并使用maxOffsetsPerTrigger来限制速率并在处理完所有数据后手动终止此作业。

  2. Use options startingOffsets and endingOffsets while making the job a batch job.在使作业成为批处理作业时使用选项startingOffsetsendingOffsets Repeat until you have processed all data within the topic.重复直到处理完主题中的所有数据。 However, there is a reason why "Streaming in RunOnce mode is better than Batch" as detailed here .但是, 这里详述了“RunOnce 模式下的流式处理优于批处理”是有原因的。

Last option would be to look into the linked pull request and compile Spark on your own.最后一个选择是查看链接的拉取请求并自行编译 Spark。

Here is how we "solved" this.这是我们“解决”这个问题的方法。 This is basically the approach mike wrote about in the accepted answer.这基本上是mike在接受的答案中写的方法。

In our case, the size of the message varied very little and we therefore knew how much time the processing of a batch takes.在我们的例子中,消息的大小变化很小,因此我们知道处理一个批处理需要多少时间。 So in a nutshell we:所以简而言之,我们:

  • changed the Trigger.Once() with Trigger.ProcessingTime(<ms>) since maxOffsetsPerTrigger works with this modeTrigger.Once()更改为 Trigger.ProcessingTime Trigger.ProcessingTime(<ms>)因为maxOffsetsPerTrigger适用于此模式
  • killed this running query by calling awaitTermination(<ms>) to mimic Trigger.Once()通过调用awaitTermination(<ms>)来模拟Trigger.Once()来终止这个正在运行的查询
  • set the processing interval to be larger than the termination interval so exactly one "batch" would fit to be processed将处理间隔设置为大于终止间隔,以便恰好一个“批次”适合处理

val kafkaOptions = Map[String, String](
      "kafka.bootstrap.servers" -> "localhost:9092",
      "failOnDataLoss" -> "false",
      "subscribePattern" -> "testTopic",
      "startingOffsets" -> "earliest",
      "maxOffsetsPerTrigger" -> "10",  // "batch" size
    )

val streamWriterOptions = Map[String, String](
    "checkpointLocation" -> "path/to/checkpoints",
  )

val processingInterval = 30000L
val terminationInterval = 15000L

sparkSession
      .readStream
      .format("kafka")
      .options(kafkaOptions)
      .load()
      .writeStream
      .options(streamWriterOptions)
      .format("Console")
      .trigger(Trigger.ProcessingTime(processingInterval))
      .start()
      .awaitTermination(terminationInterval)

This works because the first batch will be read and processed respecting the maxOffsetsPerTrigger limit.这是可行的,因为将根据maxOffsetsPerTrigger限制读取和处理第一批。 Say, in 10 seconds.说,在 10 秒内。 The second batch is then started to be processed but it is terminated in the middle of the operation after ~5s and never reaches the set 30s mark.然后开始处理第二批,但它在大约 5 秒后在操作中间终止,并且从未达到设定的 30 秒标记。 But it stores the offsets correctly.但它正确地存储了偏移量。 picks up and processes this "killed" batch in the next run.在下一次运行中拾取并处理这个“杀死”的批次。

A downside of this approach is that you have to approximately know how much time does it take to process one "batch" - if you set the terminationInterval too low the job's output will constantly be nothing.这种方法的一个缺点是您必须大致知道处理一个“批次”需要多少时间——如果您将terminationInterval设置得太低,作业的 output 将始终为零。

Of course, if you don't care about the exact number of batches you process in one run, you can easily adjust the processingInterval to be times smaller than the terminationInterval .当然,如果您不关心在一次运行中处理的批次的确切数量,您可以轻松地将processingInterval调整为比terminationInterval小几倍。 In that case, you may process a varying number of batches in one go but still respecting the value of maxOffsetsPerTrigger .在这种情况下,您可以在一个 go 中处理不同数量的批次,但仍然遵守maxOffsetsPerTrigger的值。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 Spark 结构化流与 Trigger.Once 结合使用 - Using Spark Structured Streaming with Trigger.Once Trigger.Once Spark Structured Streaming with KAFKA offsets 和写入 KAFKA 继续 - Trigger.Once Spark Structured Streaming with KAFKA offsets and writing to KAFKA continues Spark Structured Streaming 在带有 Trigger.Once 的 Databricks 上显示没有 output - Spark Structured Streaming shows no output on Databricks with Trigger.Once 如何使用 Trigger.Once 选项在 Spark 3 Structure Stream Kafka/Files 源中配置背压 - How to configure backpreasure in Spark 3 Structure Stream Kafka/Files source with Trigger.Once option PySpark 结构化流:一旦不与 Kafka 一起使用就触发 - PySpark Structured Streaming: trigger once not working with Kafka 如何在Spark结构化流中使用Scala Case类映射Kafka源 - How to use Scala Case Class to map Kafka source in Spark Structured Streaming 如何在以kafka为来源的Spark结构化流中识别消息的来源? - How to identify the origin of messages in spark structured streaming with kafka as a source? Spark:使用持久表作为 Spark 结构化流的流源 - Spark: Use Persistent Table as Streaming Source for Spark Structured Streaming “格式错误的数据长度为负”,当尝试将来自 kafka 的 Spark 结构化流与 Avro 数据源结合使用时 - “Malformed data length is negative”, when trying to use spark structured streaming from kafka with Avro data source Kafka protobuf 的 Spark 结构化流 - Spark structured streaming of Kafka protobuf
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM