简体   繁体   English

阅读Spark批处理作业中的Kafka主题

[英]Read Kafka topic in a Spark batch job

I'm writing a Spark (v1.6.0) batch job which reads from a Kafka topic. 我正在编写一个从Kafka主题读取的Spark(v1.6.0)批处理作业。
For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD however, I need to set the offsets for all the partitions and also need to store them somewhere (ZK? HDFS?) to know from where to start the next batch job. 为此,我可以使用org.apache.spark.streaming.kafka.KafkaUtils#createRDD ,我需要为所有分区设置偏移量,还需要将它们存储在某个位置(ZK,HDFS?),以了解从何处开始下一批作业。

What is the right approach to read from Kafka in a batch job? 批处理作业中读取Kafka的正确方法是什么?

I'm also thinking about writing a streaming job instead, which reads from auto.offset.reset=smallest and saves the checkpoint to HDFS and then in the next run it starts from that. 我也在考虑编写作业,该作业从auto.offset.reset=smallest读取, auto.offset.reset=smallest检查点保存到HDFS,然后在下一次运行中从此开始。

But in this case how can I just fetch once and stop streaming after the first batch ? 但是在这种情况下,我如何只提取一次并在第一批处理后停止流传输?

createRDD is the right approach for reading a batch from kafka. createRDD是从kafka读取批处理的正确方法。

To query for info about the latest / earliest available offsets, look at KafkaCluster.scala methods getLatestLeaderOffsets and getEarliestLeaderOffsets . 要查询有关最新/最早可用偏移量的信息,请查看KafkaCluster.scala方法getLatestLeaderOffsetsgetEarliestLeaderOffsets That file was private , but should be public in the latest versions of spark. 该文件是private文件,但应在最新版本的spark中public

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM