简体   繁体   中英

Java Kafka Structured Streaming

I have to perform batch queries (basically in a loop) from Kafka via Spark, each time starting from the last offset read at the previous iteration, so that I only read new data.

Dataset<Row> df = spark
                .read()
                .format("kafka")
                .option("kafka.bootstrap.servers", "localhost:9092")
                .option("subscribe", "test-reader")
                .option("enable.auto.commit", true)
                .option("kafka.group.id", "demo-reader") //not sure about the one to use
                .option("group.id", "demo-reader")
                .option("startingOffset", "latest")
                .load()

It seems that latest is not supported in batch queries. I'm wondering if it is possible to do something similar in another way (without dealing directly with offsets).

EDIT: earliest seems to retrieve the whole data contained in topic.

Can you try earliest instead of latest for startingOffsets as shown in below example:

Dataset<Row> df = spark
  .read()
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "test-reader")
  .option("enable.auto.commit", true)
  .option("kafka.group.id", "demo-reader") //not sure about the one to use
  .option("group.id", "demo-reader")
  .option("startingOffsets", "earliest")
  .option("endingOffsets", "latest")
  .load();

Please refer spark docs

You should use "latest" for streaming, "earliest" for batch as per the documentation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM