简体   繁体   English

为什么我必须使用 Kafka Streams 配置状态存储

[英]Why do I have to configure a state store with Kafka Streams

Currently I have the following setup:目前我有以下设置:

StoreBuilder storeBuilder = Stores.keyValueStoreBuilder(
    Stores.persistentKeyValueStore("kafka.topics.table"),
    new SomeKeySerde(),
    new SomeValueSerde());

streamsBuilder.addStateStore(storeBuilder);

final KStream<byte[], SomeClass> requestsStream = streamsBuilder
            .stream("myTopic", Consumed.with(Serdes.ByteArray(), theSerde));
    requestsStream
            .filter((key, request) -> Objects.nonNull(request))
            .process(() -> new SomeClassUpdater("kafka.topics.table", maxNumMatches), "kafka.topics.table");

Properties streamsConfiguration = loadConfiguration();
KafkaStreams streams = new KafkaStreams(streamsBuilder.build(), streamsConfiguration);

streams.start()

Why do I need the local state store, since I'm not doing any other computation with it and the data is also stored in the kafka changelog?为什么我需要本地状态存储,因为我没有用它做任何其他计算并且数据也存储在 kafka 更改日志中? Also at what moment does it store in the local store, does it store and commit to the changelog?另外它在什么时候存储在本地存储中,它是否存储并提交到更改日志?

The problem that I'm facing is that I'm storing localy and in time I run into memory problems especially when it repartitions often.我面临的问题是我在本地存储,并且及时遇到内存问题,尤其是当它经常重新分区时。 Because the old partitions still sit around and fill the memory.因为旧分区仍然存在并填满内存。 So my questions are, why do we need the persistence with rocksdb since:所以我的问题是,为什么我们需要 Rocksdb 的持久性,因为:

  1. the data is persisted in kafka changelog数据保存在 kafka 变更日志中
  2. ramdisk is gone anyway when the container is gone.无论如何,当容器消失时,ramdisk 也消失了。

On a single thread we can have multiple tasks equal to the no.在单个线程上,我们可以有多个任务等于 no。 of partitions of the topic.主题的分区。 Each partition has its own state store and these state stores save the data to a Changelog which is an internal topic of Kafka.每个分区都有自己的状态存储,这些状态存储将数据保存到Kafka 的内部主题 Changelog。 Each state store of a partition also maintains a replica of the state store of other partition, in order to recover the data of the partition whose task may fail.一个分区的每个状态存储还维护其他分区状态存储的副本,以便恢复其任务可能失败的分区的数据。

If you don't use state store, and one of your task fails, it will go to the internal topic ie the Changelog and then will fetch data for the partition which is a time consuming job for the CPU.如果您不使用状态存储,并且您的一项任务失败,它将转到内部主题,即更改日志,然后为分区获取数据,这对 CPU 来说是一项耗时的工作。 Hence, maintaining State Store reduces the time in which a task may fail and fetches the data from another tasks State Store immediately.因此,维护状态存储减少了任务可能失败的时间并立即从另一个任务状态存储中获取数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM