简体   繁体   中英

How to load all records from kafka topic using spark in batch mode

I want to load all records from kafka topic using spark, but all examples which I have seen were using spark-streaming. How can can I load messages fwom kafka exactly once?

Exact steps are listed in the official documentation , for example:

val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribePattern", "topic.*")
  .option("startingOffsets", "earliest")
  .option("endingOffsets", "latest")
  .load()

However "all records" is rather poorly defined if the source is continuous stream, as the result depends on the point in time, when query is executed.

Additionally you should keep in mind that parallelism is limited by the partitions of the Kafka topic, so you have to be careful not to overwhelm the cluster.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM