[英]Read Kafka topic in a Spark batch job
I'm writing a Spark (v1.6.0) batch job which reads from a Kafka topic. 我正在编写一个从Kafka主题读取的Spark(v1.6.0)批处理作业。
For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD
however, I need to set the offsets for all the partitions and also need to store them somewhere (ZK? HDFS?) to know from where to start the next batch job. 为此,我可以使用
org.apache.spark.streaming.kafka.KafkaUtils#createRDD
,我需要为所有分区设置偏移量,还需要将它们存储在某个位置(ZK,HDFS?),以了解从何处开始下一批作业。
What is the right approach to read from Kafka in a batch job? 从批处理作业中读取Kafka的正确方法是什么?
I'm also thinking about writing a streaming job instead, which reads from auto.offset.reset=smallest
and saves the checkpoint to HDFS and then in the next run it starts from that. 我也在考虑编写流作业,该作业从
auto.offset.reset=smallest
读取, auto.offset.reset=smallest
检查点保存到HDFS,然后在下一次运行中从此开始。
But in this case how can I just fetch once and stop streaming after the first batch ? 但是在这种情况下,我如何只提取一次并在第一批处理后停止流传输?
createRDD
is the right approach for reading a batch from kafka. createRDD
是从kafka读取批处理的正确方法。
To query for info about the latest / earliest available offsets, look at KafkaCluster.scala
methods getLatestLeaderOffsets
and getEarliestLeaderOffsets
. 要查询有关最新/最早可用偏移量的信息,请查看
KafkaCluster.scala
方法getLatestLeaderOffsets
和getEarliestLeaderOffsets
。 That file was private
, but should be public
in the latest versions of spark. 该文件是
private
文件,但应在最新版本的spark中public
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.