简体繁体中英

Read Kafka topic in a Spark batch job

原文 2016-06-25 08:41:33 3 1 scala/ apache-spark/ apache-kafka/ spark-streaming/ kafka-consumer-api

I'm writing a Spark (v1.6.0) batch job which reads from a Kafka topic.
For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD however, I need to set the offsets for all the partitions and also need to store them somewhere (ZK? HDFS?) to know from where to start the next batch job.

What is the right approach to read from Kafka in a batch job?

I'm also thinking about writing a streaming job instead, which reads from auto.offset.reset=smallest and saves the checkpoint to HDFS and then in the next run it starts from that.

But in this case how can I just fetch once and stop streaming after the first batch ?

1 answers

createRDD is the right approach for reading a batch from kafka.

To query for info about the latest / earliest available offsets, look at KafkaCluster.scala methods getLatestLeaderOffsets and getEarliestLeaderOffsets . That file was private , but should be public in the latest versions of spark.

Read Kafka messages in spark batch job

Read from kafka in a Spark batch job (fromOffset untilOffset)

How to consume from a different Kafka topic in each batch of a Spark Streaming job?

Spark Job is not posting message to Kafka topic

Kafka + spark streaming : Multi topic processing in single job

Read from Kafka topic process the data and write back to Kafka topic using scala and spark

Can't Read from and write to kafka topic using spark scala

How to read records from Kafka topic from beginning in Spark Streaming?

Spark Streaming - write to Kafka topic

Spark streaming Job aborted due to stage failure when reading from kafka topic

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Read Kafka messages in spark batch job Read from kafka in a Spark batch job (fromOffset untilOffset) How to consume from a different Kafka topic in each batch of a Spark Streaming job? Spark Job is not posting message to Kafka topic Kafka + spark streaming : Multi topic processing in single job Read from Kafka topic process the data and write back to Kafka topic using scala and spark Can't Read from and write to kafka topic using spark scala How to read records from Kafka topic from beginning in Spark Streaming? Spark Streaming - write to Kafka topic Spark streaming Job aborted due to stage failure when reading from kafka topic

Related Tags

Read Kafka topic in a Spark batch job

Question

1 answers

solution1 4 ACCPTED 2016-07-05 14:50:09

solution1
4 ACCPTED 2016-07-05 14:50:09