简体   繁体   English

在 Spark 中读取 Kafka 主题尾部

[英]Read Kafka topic tail in Spark

I need to subscribe to Kafka topic latest offset, read some newest records, print them and finish.我需要订阅 Kafka 主题latest偏移量,读取一些最新记录,打印它们并完成。 How can I do this in Spark?我怎样才能在 Spark 中做到这一点? I suppose I could do something like this我想我可以做这样的事情

sqlContext
    .read
    .format("kafka")
    .option("kafka.bootstrap.servers", "192.168.1.1:9092,...")
    .option("subscribe", "myTopic")
    .option("startingOffsets", "latest")
    .filter($"someField" === "someValue")
    .take(10)
    .show

You need to be aware in advance until which offsets in which partitions you want to consume from Kafka.您需要提前知道要从 Kafka 消费哪些分区中的哪些偏移量。 If you have that information, you can do something like:如果您有这些信息,您可以执行以下操作:

// Subscribe to multiple topics, specifying explicit Kafka offsets
val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "192.168.1.1:9092,...")
  .option("subscribe", "myTopic")
  .option("startingOffsets", """{"myTopic":{"0":20,"1":20}}""")
  .option("endingOffsets", """{"myTopic":{"0":25,"1":25}}""")
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]
  .filter(...)

More details on the startingOffsets and endingOffsets are given in the Kafka + Spark Integration Guide Kafka + Spark 集成指南中提供了有关startingOffsets endingOffsetsendingOffsets更多详细信息

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM