简体   繁体   中英

Accumulate Events from Multiple Topics using Kafka Streams

I do apologise if this is a dumb question.

I have a scenario whereby I have 3 topics from an upstream service (which are not keyed). Unfortunately, I can't change the behaviour of the 3 topics.

The upstream service bulk publishes all the messages at the end of the day, and I need to get an accumulated view of the transactions, since the order of the transactions matters for a downstream service.

I understand I can't re-order the messages in the different partitions of the topics, so I figured if i could accumulate them, and my service could then take the accumulated result and re-order them before processing.

However, I am noticing a weird behaviour, and I am hoping someone can clarify what I am missing.

When I do the operation with 1 to 500 accounts, I see 500 messages accumulated and displayed in the output topic.

However, when I try the same operation with 10,000 accounts, I see more output than there should be. (13,000 messages on output topic).

    KStream<String, TransactionAccumulator> transactions =
        disbursements
            .merge(repayments)
            .merge(fees)
            .groupBy(
                (k, v) -> v.getAccountId(),
                with(
                    String(),
                    serdeFrom(
                        new JsonSerializer<>(mapper),
                        new JsonDeserializer<>(Transaction.class, mapper))))
            .windowedBy(SessionWindows.with(Duration.of(1, ChronoUnit.MINUTES)))
            .aggregate(
                TransactionAccumulator::new,
                (key, value, aggregate) -> aggregate.add(value),
                (aggKey, aggOne, aggTwo) -> aggOne.merge(aggTwo),
                Materialized.with(
                    String(),
                    serdeFrom(
                        new JsonSerializer<>(mapper),
                        new JsonDeserializer<>(TransactionAccumulator.class, mapper))))
            .toStream((key, value) -> key.key());

As stated earlier, the upstream service bulk publishes all the events at the end of the day (instead of real-time).

Would appreciate what I am missing here, since for smaller volumes, it seems to work.


Update 1

I tried the suggestion proposed of using suppression to try to only send the final window.

However, when using this, it basically does not publish any messages to the output topic, though I see that there are messages in the "KTABLE-SUPPRESS-STATE-STORE"

The updated code with the suppress is as follows.

   disbursements
        .merge(repayments)
        .merge(fees)
        .groupBy(
            (key, value) -> value.getAccountId(),
            Grouped.with(
                Serdes.String(),
                Serdes.serdeFrom(
                    new JsonSerializer<>(mapper),
                    new JsonDeserializer<>(Transaction.class, mapper))))
        .windowedBy(TimeWindows.of(Duration.ofMinutes(1)).grace(ofMinutes(1)))
        .aggregate(
            TransactionAccumulator::new,
            (key, value, aggregate) -> aggregate.add(value),
            Materialized.with(
                Serdes.String(),
                Serdes.serdeFrom(
                    new JsonSerializer<>(mapper),
                    new JsonDeserializer<>(TransactionAccumulator.class, mapper))))
        .suppress(Suppressed.untilWindowCloses(BufferConfig.unbounded()))
        .mapValues(
            value -> {
              LOGGER.info(
                  "Sending {} Transactions for {}",
                  value.getTransactions().size(),
                  value.getAccountId());
              return value;
            })
        .toStream((key, value) -> key.key());

I also do not see the log messages introduced. For clarity, I am using Spring Cloud Stream in this experiment, and the final log entries I see on the stream-app are as follows.

INFO 23436 --- [-StreamThread-1] org.apache.kafka.streams.KafkaStreams    : stream-client [StreamConsumer-consume-applicationId-de25a238-5f0f-4d84-9bd2-3e7b01b7f0b3] State transition from REBALANCING to RUNNING
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode

Sorry, I can't comment yet, but here are my two cents:

  1. KGroupedStream.aggregate() : Kafka Stream uses a record cache to control the rate at which aggregated updates are emitted from Materialized view (or KTable) of aggregate to state store and downstream processor. Eg with messages:
("word1", 4)
("word1", 2)
("word2", 3)
("word1", 1)

And your word count topology:

wordCntPerSentenceKStream
    .groupByKey()
    .aggregate(() -> 0, (word, newWordCnt, aggsWordCnt) -> aggsWordCnt + newWordCnt, Materialized.as("word-cnt-store").withValueSerde(Serdes.Integer())
    .toStream();

you may received downstream messages like this:

("word1", 6)
("word2", 3)
("word1", 7)

So my guess is that your input topic may contains multiple transactions for a single AccountId, and record cache get flushed when the cache ( cache.max.bytes.buffering ) is full or commit.interval.ms is meeted.

  1. If your sink is Idempotent, you can just override your TransactionAccumulator with the new message key or you can use KTable.suppress() as stated here to only emit the last message of the aggregated window.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM