简体   繁体   中英

Why do I have to configure a state store with Kafka Streams

Currently I have the following setup:

StoreBuilder storeBuilder = Stores.keyValueStoreBuilder(
    Stores.persistentKeyValueStore("kafka.topics.table"),
    new SomeKeySerde(),
    new SomeValueSerde());

streamsBuilder.addStateStore(storeBuilder);

final KStream<byte[], SomeClass> requestsStream = streamsBuilder
            .stream("myTopic", Consumed.with(Serdes.ByteArray(), theSerde));
    requestsStream
            .filter((key, request) -> Objects.nonNull(request))
            .process(() -> new SomeClassUpdater("kafka.topics.table", maxNumMatches), "kafka.topics.table");

Properties streamsConfiguration = loadConfiguration();
KafkaStreams streams = new KafkaStreams(streamsBuilder.build(), streamsConfiguration);

streams.start()

Why do I need the local state store, since I'm not doing any other computation with it and the data is also stored in the kafka changelog? Also at what moment does it store in the local store, does it store and commit to the changelog?

The problem that I'm facing is that I'm storing localy and in time I run into memory problems especially when it repartitions often. Because the old partitions still sit around and fill the memory. So my questions are, why do we need the persistence with rocksdb since:

  1. the data is persisted in kafka changelog
  2. ramdisk is gone anyway when the container is gone.

On a single thread we can have multiple tasks equal to the no. of partitions of the topic. Each partition has its own state store and these state stores save the data to a Changelog which is an internal topic of Kafka. Each state store of a partition also maintains a replica of the state store of other partition, in order to recover the data of the partition whose task may fail.

If you don't use state store, and one of your task fails, it will go to the internal topic ie the Changelog and then will fetch data for the partition which is a time consuming job for the CPU. Hence, maintaining State Store reduces the time in which a task may fail and fetches the data from another tasks State Store immediately.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM