简体   繁体   中英

How to process topic TBs of data in an efficient way in kafka streams

I have a simple question related to kafka. I hope I will get some good answers here.

I have an kafka streams application in which I want to tackle a simple scenario where I want to maintain a state store for querying and storing data. Topic has TB's of data which I want to process. I want to create a state store with key, value different than topic key and value. Basically store key will be a part of topic value field and value will be something else. So for this purpose I have to read data from a kafka topic and deserialize a value and get some part of data which will be the key for store.

My questions:

1) What would be best way to possible for this task if topic has TBs of data, as processing of every record in a topic can cost too much.

2) which topology (DSL, Processor API, mix of both) will best suits this scenario and why.

@Parkash based on your question here is a rough idea that you can use (Please edit your question or provide some examples to get a more specific answer)

  1. Irrespective of the amount of data in your source topic, if the topic has been partitioned appropriately you should be able to parallelize the reads. Please refer to the streams threading model here https://kafka.apache.org/23/documentation/streams/architecture#streams_architecture_threads

  2. You will need to read all the key-value pairs, I do not see a materialization option in any of the stateless operations (from your question it looks like you are trying to do only stateless operations) so I suppose you will need to use the Processor API to build your state store.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM