Read Kafka topic tail in Spark

Question

I need to subscribe to Kafka topic latest offset, read some newest records, print them and finish. How can I do this in Spark? I suppose I could do something like this

sqlContext
    .read
    .format("kafka")
    .option("kafka.bootstrap.servers", "192.168.1.1:9092,...")
    .option("subscribe", "myTopic")
    .option("startingOffsets", "latest")
    .filter($"someField" === "someValue")
    .take(10)
    .show

Answer 1

You need to be aware in advance until which offsets in which partitions you want to consume from Kafka. If you have that information, you can do something like:

// Subscribe to multiple topics, specifying explicit Kafka offsets
val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "192.168.1.1:9092,...")
  .option("subscribe", "myTopic")
  .option("startingOffsets", """{"myTopic":{"0":20,"1":20}}""")
  .option("endingOffsets", """{"myTopic":{"0":25,"1":25}}""")
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]
  .filter(...)

More details on the startingOffsets and endingOffsets are given in the Kafka + Spark Integration Guide

Read Kafka topic tail in Spark

Question

1 answers

solution1
0 2020-08-27 13:12:37

Read Kafka topic tail in Spark

Question

1 answers

solution1 0 2020-08-27 13:12:37

solution1
0 2020-08-27 13:12:37