简体   繁体   English

Spark 结构化流确认消息

[英]Spark structured streaming acknowledge messages

I am using Spark Structured Streaming to read from a Kafka topic (say topic1) and using SINK to write to another topic (topic1-result).我正在使用Spark Structured Streaming从 Kafka 主题(比如 topic1)读取并使用 SINK 写入另一个主题(topic1-result)。 I can see the messages are not being removed from Topic1 after writing to another topic using Sink.在使用 Sink 写入另一个主题后,我可以看到消息没有从 Topic1 中删除。

// Subscribe to 1 topic
val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1")
  .option("subscribe", "topic1")
  .load()

//SINK to another topic 
val ds = df
  .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .writeStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1")
  .option("checkpointLocation", "/tmp/checkpoint1")
  .option("topic", "topic1-result")
  .start()

the documentation says we can not use auto-commit for structured streams文档说我们不能对结构化流使用自动提交

enable.auto.commit: Kafka source doesn't commit any offset. enable.auto.commit:Kafka 源不提交任何偏移量。

but how to acknowledge messages and remove the processed messages from the topic (topic1)但是如何确认消息并从主题(topic1)中删除已处理的消息

Two considerations:两个考虑:

  1. Messages are not removed from Kafka once you have committed.提交后,消息不会从 Kafka 中删除。 When your consumer executes commit, Kafka increases the offset of this topic respect to the consumer-group that has been created.当您的消费者执行提交时,Kafka 会增加此主题相对于已创建的消费者组的偏移量。 But messages remain in the topic depending on the retention time that you configure for the topic.但消息会保留在主题中,具体取决于您为主题配置的保留时间。

  2. Indeed, Kafka source doesn´t make the commit, the stream storages the offset that points to the next message in the streaming´s checkpoint dir.实际上,Kafka 源不进行提交,流存储指向流检查点目录中下一条消息的偏移量。 So when you stream restarts it takes the last offset to consume from it.因此,当您重新启动流时,它会从中消耗最后一个偏移量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM