简体   繁体   中英

What is the best way to consume the same topic from many different kafka brokers with spark structured streaming?

I have a situation where my load is distributed between a few data centers (dc), which in each data center has its own Kafka Broker and data processors that process the data only for its data center. So, I'll have the brokers broker1-dc1, broker1-dc2,..,broker1-dcn , and all brokers will have the same topics, eg DATA_TOPIC .

I want is to consume the topic DATA_TOPIC from all my different brokers and persist this data in a single data lake table, I am doing it with structured streaming, but that isn't a requirement.

I don't have much experience with spark and what I want to know is the best way that I can do this, I'm considering two options:

  1. Have different spark jobs, in which each one consumes the data from a different data center and have a unique checkpoint location;
  2. Have a unique job that has a consumer (Kafka readStream) for each data center, and do a union between all consumers

Which of these options are better, or Is there an even better option?

I don't know if this helps, but I'm planning to use an AWS architecture with EMR, S3, Glue, and delta lake or iceberg as table formats.

Thanks

Kafka clients can only use one bootstrap.servers at a time, so if the plan is to define N streaming dataframes, that seems like a poor design choice since one failing stream ideally shouldn't stop your application.

Instead, I'd suggest looking into using MirrorMaker2 to consolidate topics into one Kafka cluster that you'll run processing against, which should result in the same effect as the union.

Your first option is somewhat similar, but it's a tradeoff on if you want to manage N Spark applications along with their checkpoints, or N Kafka Connect processes that serve a single purpose and can be ran in one Connect cluster

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM