如何使用Spark的Kafka直接流设置消费者群体提交的偏移量？

Question

I am trying to use Spark's Direct Approach (No Receivers) for Kafka , I have following Kafka configuration map: 我正在尝试使用Spark的直接方法（无接收器）用于Kafka ，我有以下Kafka配置图：

configMap.put("zookeeper.connect","192.168.51.98:2181");
configMap.put("group.id", UUID.randomUUID().toString());
configMap.put("auto.offset.reset","smallest");
configMap.put("auto.commit.enable","true");
configMap.put("topics","IPDR31");
configMap.put("kafka.consumer.id","kafkasparkuser");
configMap.put("bootstrap.servers","192.168.50.124:9092");

Now my objective is, if my Spark pipeline crashes and it is started again, the stream should be started from the latest offset committed by the consumer group. 现在我的目标是，如果我的Spark管道崩溃并再次启动，则应该从使用者组提交的最新偏移量开始流。 So, for that purpose, I want to specify the starting offset for consumer. 因此，为此目的，我想为消费者指定起始偏移量。 I have information about the offsets committed in each partition. 我有关于每个分区中提交的偏移量的信息。 How I can supply this information to the streaming function. 我如何将此信息提供给流功能。 Currently I am using 目前我正在使用

JavaPairInputDStream<byte[], byte[]> kafkaData =
   KafkaUtils.createDirectStream(js, byte[].class, byte[].class,
     DefaultDecoder.class, DefaultDecoder.class,configMap,topic);

Answer 1

Look at the second form of createDirectStream in the Spark API docs - it allows you to pass in a Map<TopicAndPartition, Long> , where the Long is the offset. 在Spark API文档中查看createDirectStream的第二种形式 - 它允许您传入Map<TopicAndPartition, Long> ，其中Long是偏移量。

Note that Spark will not automatically update your offsets in Zookeeper when using a DirectInputStream - you have to write them yourself either to ZK or some other database. 请注意，使用DirectInputStream时，Spark不会自动更新Zookeeper中的偏移量 - 您必须自己将它们写入ZK或其他数据库。 Unless you have a strict requirement of exactly-once semantics, it will be easier to use the createStream method to get back a DStream, in which case Spark will update the offsets in ZK and resume from the last stored offset in the case of failure. 除非您严格要求完全一次语义，否则使用createStream方法更容易获取DStream，在这种情况下，Spark将更新ZK中的偏移量，并在发生故障时从最后存储的偏移量中恢复。

Answer 2

For your requirement, the correct solution is to use checkpoint. 根据您的要求，正确的解决方案是使用检查点。 For every RDDStream processed, checkpoint will write the metadata to a shared storage specified (typically hdfs). 对于每个已处理的RDDStream，检查点都会将元数据写入指定的共享存储（通常为hdfs）。 Its metadata, not the real data so there is no real performance impact. 它的元数据，而不是真实的数据，所以没有真正的性能影响。

If the spark process is crashed and restarted, it will first read the checkpoint and resume from the saved offsets from checkpoint. 如果火花过程崩溃并重新启动，它将首先读取检查点并从检查点保存的偏移量中恢复。

You can refer the sample code where i use spark streaming to write data to elasticsearch reliabily using checkpoint. 你可以参考我使用spark streaming的示例代码，使用checkpoint将数据写入elasticsearch。 https://github.com/atulsm/Test_Projects/blob/master/src/spark/StreamingKafkaRecoverableDirectEvent.java https://github.com/atulsm/Test_Projects/blob/master/src/spark/StreamingKafkaRecoverableDirectEvent.java

如何使用Spark的Kafka直接流设置消费者群体提交的偏移量？

问题描述

2 个解决方案

解决方案1
6 已采纳 2015-07-08 09:33:13

解决方案2
2 2015-12-02 15:04:14

如何使用Spark的Kafka直接流设置消费者群体提交的偏移量？

问题描述

2 个解决方案

解决方案1 6 已采纳 2015-07-08 09:33:13

解决方案2 2 2015-12-02 15:04:14

解决方案1
6 已采纳 2015-07-08 09:33:13

解决方案2
2 2015-12-02 15:04:14