How to load all records from kafka topic using spark in batch mode

Question

I want to load all records from kafka topic using spark, but all examples which I have seen were using spark-streaming. How can can I load messages fwom kafka exactly once?

Answer 1

Exact steps are listed in the official documentation , for example:

val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribePattern", "topic.*")
  .option("startingOffsets", "earliest")
  .option("endingOffsets", "latest")
  .load()

However "all records" is rather poorly defined if the source is continuous stream, as the result depends on the point in time, when query is executed.

Additionally you should keep in mind that parallelism is limited by the partitions of the Kafka topic, so you have to be careful not to overwhelm the cluster.

How to load all records from kafka topic using spark in batch mode

Question

1 answers

solution1
2 ACCPTED 2019-06-21 10:59:45

How to load all records from kafka topic using spark in batch mode

Question

1 answers

solution1 2 ACCPTED 2019-06-21 10:59:45

solution1
2 ACCPTED 2019-06-21 10:59:45