简体繁体 English

如何在Kafka流中高效处理主题TB数据

[英]How to process topic TBs of data in an efficient way in kafka streams

原文 2019-09-14 18:55:45 9 1 apache-kafka/ apache-kafka-streams

I have a simple question related to kafka. 我有一个与kafka有关的简单问题。 I hope I will get some good answers here. 我希望我能在这里得到一些好的答案。

I have an kafka streams application in which I want to tackle a simple scenario where I want to maintain a state store for querying and storing data. 我有一个Kafka Streams应用程序，在其中我想解决一个简单的场景，在该场景中我想维护状态存储以查询和存储数据。 Topic has TB's of data which I want to process. 主题包含我要处理的TB数据。 I want to create a state store with key, value different than topic key and value. 我想创建一个状态存储，其键值不同于主题键和值。 Basically store key will be a part of topic value field and value will be something else. 基本上，存储键将是主题值字段的一部分，而值将是其他内容。 So for this purpose I have to read data from a kafka topic and deserialize a value and get some part of data which will be the key for store. 因此，为此目的，我必须从kafka主题中读取数据并反序列化值，并获取部分数据，这将是存储的关键。

My questions: 我的问题：

1) What would be best way to possible for this task if topic has TBs of data, as processing of every record in a topic can cost too much. 1）如果主题具有TB的数据，那么最好的方法是实现此任务，因为处理主题中的每个记录可能会花费太多。

2) which topology (DSL, Processor API, mix of both) will best suits this scenario and why. 2）哪种拓扑（DSL，处理器API，两者的混合）最适合这种情况以及原因。

1 个解决方案

@Parkash based on your question here is a rough idea that you can use (Please edit your question or provide some examples to get a more specific answer) 根据您的问题，@ Parkash是一个大致概念，您可以使用（请编辑您的问题或提供一些示例以获取更具体的答案）

Irrespective of the amount of data in your source topic, if the topic has been partitioned appropriately you should be able to parallelize the reads. 不管源主题中的数据量如何，如果主题已适当划分，您都应该能够并行读取。 Please refer to the streams threading model here https://kafka.apache.org/23/documentation/streams/architecture#streams_architecture_threads 请在此处参考流线程模型https://kafka.apache.org/23/documentation/streams/architecture#streams_architecture_threads
You will need to read all the key-value pairs, I do not see a materialization option in any of the stateless operations (from your question it looks like you are trying to do only stateless operations) so I suppose you will need to use the Processor API to build your state store. 您将需要读取所有键值对，但在任何无状态操作中都看不到实现选项（从您的问题来看，您似乎仅尝试执行无状态操作），所以我想您需要使用用于建立状态存储的处理器API。