简体   繁体   English

有没有办法从 Spark 流作业中读取 Kafka 流中的特定偏移量?

[英]Is there a way to read from specific offset in a Kafka stream from a Spark streaming job?

I am trying to commit offsets from my Spark streaming job to Kafka using the following:我正在尝试使用以下方法将我的 Spark 流作业的偏移量提交给 Kafka:

OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();

            // some time later, after outputs have completed
              ((CanCommitOffsets) stream.inputDStream()).commitAsync(offsetRanges);

as I got from this question:正如我从这个问题中得到的:

Spark DStream from Kafka always starts at beginning 来自 Kafka 的 Spark DStream 总是从头开始

And this works fine, offsets are being committed.这工作正常,正在提交偏移量。 However, the problem is that this is asynchronous, which means that even after two more offset commits have been sent down the line, Kafka may still hold on to the offset two commits before.然而,问题在于这是异步的,这意味着即使在发送了另外两次偏移提交之后,Kafka 仍然可能保持之前两次提交的偏移。 If the consumer crashes at that point, and I bring it back up, it starts reading messages which have already been processed.如果消费者在那个时候崩溃了,我把它恢复过来,它就会开始读取已经处理过的消息。

Now, from other sources, like the comments section here:现在,从其他来源,比如这里的评论部分:

https://dzone.com/articles/kafka-clients-at-most-once-at-least-once-exactly-o https://dzone.com/articles/kafka-clients-at-most-once-at-least-once-exactly-o

I understood that there's no way to commit offsets synchronously from a Spark streaming job, (though there is one if I use Kafka streams).我知道无法从 Spark 流作业同步提交偏移量(尽管如果我使用 Kafka 流,则有一个)。 People rather suggest to keep the offsets in the databases where you are persisting the end result of your calculations on the stream.人们宁愿建议将偏移量保留在数据库中,您将计算的最终结果保留在流中。

Now, my question is this: If I DO store the currently read offset in my database, how do I start reading the stream from exactly that offset the next time?现在,我的问题是:如果我确实将当前读取的偏移量存储在我的数据库中,那么下次我如何从该偏移量开始读取流?

I researched and found the answer to my question, so I'm posting it here for anyone else who might face the same problem:我进行了研究并找到了我的问题的答案,因此我将其发布在这里供其他可能面临相同问题的人使用:

  • Make a Map object with org.apache.kafka.common.TopicPartition as the key and a Long as the value.创建一个 Map 对象,以 org.apache.kafka.common.TopicPartition 作为键,一个 Long 作为值。 The TopicPartition constructor takes two arguments, the topic name and the partition from which you will be reading. TopicPartition 构造函数接受两个参数,主题名称和您将读取的分区。 The value of the Map object is the long representation of the offset from which you want to read the stream. Map 对象的值是您要从中读取流的偏移量的长表示形式。

    Map startingOffset = new HashMap<>();映射开始偏移 = 新的 HashMap<>(); startingOffset.put(new TopicPartition("topic_name", 0), 3332980L);起始偏移量.put(新主题分区(“主题名称”,0),3332980L);

  • Read the stream contents into an appropriate JavaInputStream, and provide the previously created Map object as an argument to the ConsumerStrategies.Subscribe() method.将流内容读入适当的 JavaInputStream,并将先前创建的 Map 对象作为参数提供给 ConsumerStrategies.Subscribe() 方法。

    final JavaInputDStream> stream = KafkaUtils.createDirectStream(jssc, LocationStrategies.PreferConsistent(), ConsumerStrategies.Subscribe(topics, kafkaParams, startingOffset));最终 JavaInputDStream> 流 = KafkaUtils.createDirectStream(jssc, LocationStrategies.PreferConsistent(), ConsumerStrategies.Subscribe(topics, kafkaParams,startingOffset));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM