[英]Java Kafka Structured Streaming
I have to perform batch queries (basically in a loop) from Kafka via Spark, each time starting from the last offset read at the previous iteration, so that I only read new data.我必须通过 Spark 从 Kafka 执行批量查询(基本上是在一个循环中),每次都从上一次迭代中读取的最后一个偏移量开始,以便我只读取新数据。
Dataset<Row> df = spark
.read()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test-reader")
.option("enable.auto.commit", true)
.option("kafka.group.id", "demo-reader") //not sure about the one to use
.option("group.id", "demo-reader")
.option("startingOffset", "latest")
.load()
It seems that latest
is not supported in batch queries.批处理查询似乎不支持
latest
。 I'm wondering if it is possible to do something similar in another way (without dealing directly with offsets).我想知道是否有可能以另一种方式做类似的事情(不直接处理偏移量)。
EDIT: earliest
seems to retrieve the whole data contained in topic.编辑:
earliest
似乎检索到主题中包含的全部数据。
Can you try earliest
instead of latest
for startingOffsets
as shown in below example:您可以尝试
earliest
而不是latest
作为startingOffsets
,如下例所示:
Dataset<Row> df = spark
.read()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test-reader")
.option("enable.auto.commit", true)
.option("kafka.group.id", "demo-reader") //not sure about the one to use
.option("group.id", "demo-reader")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load();
Please refer spark docs请参考火花文档
You should use "latest" for streaming, "earliest" for batch as per the documentation.根据文档,您应该使用“最新”进行流式传输,使用“最早”进行批处理。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.