简体   繁体   English

如何在Kafka流中高效处理主题TB数据

[英]How to process topic TBs of data in an efficient way in kafka streams

I have a simple question related to kafka. 我有一个与kafka有关的简单问题。 I hope I will get some good answers here. 我希望我能在这里得到一些好的答案。

I have an kafka streams application in which I want to tackle a simple scenario where I want to maintain a state store for querying and storing data. 我有一个Kafka Streams应用程序,在其中我想解决一个简单的场景,在该场景中我想维护状态存储以查询和存储数据。 Topic has TB's of data which I want to process. 主题包含我要处理的TB数据。 I want to create a state store with key, value different than topic key and value. 我想创建一个状态存储,其键值不同于主题键和值。 Basically store key will be a part of topic value field and value will be something else. 基本上,存储键将是主题值字段的一部分,而值将是其他内容。 So for this purpose I have to read data from a kafka topic and deserialize a value and get some part of data which will be the key for store. 因此,为此目的,我必须从kafka主题中读取数据并反序列化值,并获取部分数据,这将是存储的关键。

My questions: 我的问题:

1) What would be best way to possible for this task if topic has TBs of data, as processing of every record in a topic can cost too much. 1)如果主题具有TB的数据,那么最好的方法是实现此任务,因为处理主题中的每个记录可能会花费太多。

2) which topology (DSL, Processor API, mix of both) will best suits this scenario and why. 2)哪种拓扑(DSL,处理器API,两者的混合)最适合这种情况以及原因。

@Parkash based on your question here is a rough idea that you can use (Please edit your question or provide some examples to get a more specific answer) 根据您的问题,@ Parkash是一个大致概念,您可以使用(请编辑您的问题或提供一些示例以获取更具体的答案)

  1. Irrespective of the amount of data in your source topic, if the topic has been partitioned appropriately you should be able to parallelize the reads. 不管源主题中的数据量如何,如果主题已适当划分,您都应该能够并行读取。 Please refer to the streams threading model here https://kafka.apache.org/23/documentation/streams/architecture#streams_architecture_threads 请在此处参考流线程模型https://kafka.apache.org/23/documentation/streams/architecture#streams_architecture_threads

  2. You will need to read all the key-value pairs, I do not see a materialization option in any of the stateless operations (from your question it looks like you are trying to do only stateless operations) so I suppose you will need to use the Processor API to build your state store. 您将需要读取所有键值对,但在任何无状态操作中都看不到实现选项(从您的问题来看,您似乎仅尝试执行无状态操作),所以我想您需要使用用于建立状态存储的处理器API。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在kafka流中动态处理并发送到不同的主题 - How to process dynamically in kafka streams and send to different topic Kafka Streams:如何写主题? - Kafka Streams: how to write to a topic? 如何使用 kafka 流以块/批次的形式处理数据? - how to process data in chunks/batches with kafka streams? Kafka Streams:当Kafka Streams将数据写入目标主题时,如何捕获事件 - Kafka Streams : How to capture event when kafka streams writes data into target topic 有没有办法对 Kafka 流中的输入主题进行重新分区? - Is there a way to repartition the input topic in Kafka streams? 如何处理未授权访问Kafka Streams中的主题… - How to handle Not authorized to access topic … in Kafka Streams Kafka 流:如何在聚合时产生一个主题? - Kafka streams: how to produce to a topic while aggregating? 如何使用交互式查询存储和全局存储实现处理单个主题的 Kafka Streams 拓扑 - How to implement Kafka Streams topology that process single topic with interactive queries store and global store Kafka 流交互式查询 - 如何在查询 state 存储之前等待流处理来自输入主题的所有当前记录 - Kafka streams interactive queries - how to wait for streams to process all current records from input topic before querying state store Kafka Streams主题 - Kafka Streams to topic
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM