简体   繁体   English

使用 Kafka Streams 从多个主题中累积事件

[英]Accumulate Events from Multiple Topics using Kafka Streams

I do apologise if this is a dumb question.如果这是一个愚蠢的问题,我深表歉意。

I have a scenario whereby I have 3 topics from an upstream service (which are not keyed).我有一个场景,我有来自上游服务的 3 个主题(没有键控)。 Unfortunately, I can't change the behaviour of the 3 topics.不幸的是,我无法改变这 3 个主题的行为。

The upstream service bulk publishes all the messages at the end of the day, and I need to get an accumulated view of the transactions, since the order of the transactions matters for a downstream service.上游服务在一天结束时批量发布所有消息,我需要获得事务的累积视图,因为事务的顺序对下游服务很重要。

I understand I can't re-order the messages in the different partitions of the topics, so I figured if i could accumulate them, and my service could then take the accumulated result and re-order them before processing.我知道我无法对主题的不同分区中的消息重新排序,所以我想是否可以累积它们,然后我的服务可以获取累积的结果并在处理之前重新排序它们。

However, I am noticing a weird behaviour, and I am hoping someone can clarify what I am missing.但是,我注意到一种奇怪的行为,我希望有人能澄清我遗漏了什么。

When I do the operation with 1 to 500 accounts, I see 500 messages accumulated and displayed in the output topic.当我使用 1 到 500 个帐户进行操作时,我看到 500 条消息累积并显示在输出主题中。

However, when I try the same operation with 10,000 accounts, I see more output than there should be.但是,当我对 10,000 个帐户尝试相同的操作时,我看到的输出超出了预期。 (13,000 messages on output topic). (关于输出主题的 13,000 条消息)。

    KStream<String, TransactionAccumulator> transactions =
        disbursements
            .merge(repayments)
            .merge(fees)
            .groupBy(
                (k, v) -> v.getAccountId(),
                with(
                    String(),
                    serdeFrom(
                        new JsonSerializer<>(mapper),
                        new JsonDeserializer<>(Transaction.class, mapper))))
            .windowedBy(SessionWindows.with(Duration.of(1, ChronoUnit.MINUTES)))
            .aggregate(
                TransactionAccumulator::new,
                (key, value, aggregate) -> aggregate.add(value),
                (aggKey, aggOne, aggTwo) -> aggOne.merge(aggTwo),
                Materialized.with(
                    String(),
                    serdeFrom(
                        new JsonSerializer<>(mapper),
                        new JsonDeserializer<>(TransactionAccumulator.class, mapper))))
            .toStream((key, value) -> key.key());

As stated earlier, the upstream service bulk publishes all the events at the end of the day (instead of real-time).如前所述,上游服务在一天结束时(而不是实时)发布所有事件。

Would appreciate what I am missing here, since for smaller volumes, it seems to work.很感激我在这里缺少的东西,因为对于较小的体积,它似乎有效。


Update 1更新 1

I tried the suggestion proposed of using suppression to try to only send the final window.我尝试了使用抑制来尝试仅发送最终窗口的建议。

However, when using this, it basically does not publish any messages to the output topic, though I see that there are messages in the "KTABLE-SUPPRESS-STATE-STORE"但是,使用它时,它基本上不会向输出主题发布任何消息,尽管我看到“KTABLE-SUPPRESS-STATE-STORE”中有消息

The updated code with the suppress is as follows.带有抑制的更新代码如下。

   disbursements
        .merge(repayments)
        .merge(fees)
        .groupBy(
            (key, value) -> value.getAccountId(),
            Grouped.with(
                Serdes.String(),
                Serdes.serdeFrom(
                    new JsonSerializer<>(mapper),
                    new JsonDeserializer<>(Transaction.class, mapper))))
        .windowedBy(TimeWindows.of(Duration.ofMinutes(1)).grace(ofMinutes(1)))
        .aggregate(
            TransactionAccumulator::new,
            (key, value, aggregate) -> aggregate.add(value),
            Materialized.with(
                Serdes.String(),
                Serdes.serdeFrom(
                    new JsonSerializer<>(mapper),
                    new JsonDeserializer<>(TransactionAccumulator.class, mapper))))
        .suppress(Suppressed.untilWindowCloses(BufferConfig.unbounded()))
        .mapValues(
            value -> {
              LOGGER.info(
                  "Sending {} Transactions for {}",
                  value.getTransactions().size(),
                  value.getAccountId());
              return value;
            })
        .toStream((key, value) -> key.key());

I also do not see the log messages introduced.我也没有看到引入的日志消息。 For clarity, I am using Spring Cloud Stream in this experiment, and the final log entries I see on the stream-app are as follows.为了清楚起见,我在这个实验中使用的是Spring Cloud Stream,我在stream-app上看到的最终日志条目如下。

INFO 23436 --- [-StreamThread-1] org.apache.kafka.streams.KafkaStreams    : stream-client [StreamConsumer-consume-applicationId-de25a238-5f0f-4d84-9bd2-3e7b01b7f0b3] State transition from REBALANCING to RUNNING
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode
INFO 23436 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore      : Opening store KSTREAM-AGGREGATE-STATE-STORE-0000000006.1583625600000 in regular mode

Sorry, I can't comment yet, but here are my two cents:抱歉,我还不能发表评论,但这是我的两分钱:

  1. KGroupedStream.aggregate() : Kafka Stream uses a record cache to control the rate at which aggregated updates are emitted from Materialized view (or KTable) of aggregate to state store and downstream processor. KGroupedStream.aggregate()卡夫卡流使用一个记录缓存以控制聚集更新从的物化视图(或KTable)发射的速率aggregate到状态存储和下游处理器。 Eg with messages:例如使用消息:
("word1", 4)
("word1", 2)
("word2", 3)
("word1", 1)

And your word count topology:还有你的字数拓扑:

wordCntPerSentenceKStream
    .groupByKey()
    .aggregate(() -> 0, (word, newWordCnt, aggsWordCnt) -> aggsWordCnt + newWordCnt, Materialized.as("word-cnt-store").withValueSerde(Serdes.Integer())
    .toStream();

you may received downstream messages like this:您可能会收到这样的下游消息:

("word1", 6)
("word2", 3)
("word1", 7)

So my guess is that your input topic may contains multiple transactions for a single AccountId, and record cache get flushed when the cache ( cache.max.bytes.buffering ) is full or commit.interval.ms is meeted.所以我的猜测是您的输入主题可能包含单个 AccountId 的多个事务,并且当缓存 ( cache.max.bytes.buffering ) 已满或commit.interval.ms时,记录缓存将被刷新。

  1. If your sink is Idempotent, you can just override your TransactionAccumulator with the new message key or you can use KTable.suppress() as stated here to only emit the last message of the aggregated window.如果您的接收器是幂等,你可以重写你的TransactionAccumulator用新的消息键,也可以使用KTable.suppress()的说明这里只发出汇总窗口的最后一条消息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何关联来自多个主题的 kafka 事件? - How to correlate kafka events from multiple topics? Kafka Streams:使用 Spring Cloud Stream 为每组主题定义多个 Kafka Streams - Kafka Streams: Define multiple Kafka Streams using Spring Cloud Stream for each set of topics 使用多个输入主题时,输入主题的不同消息率会影响kafka流处理速度吗? - When using multiple input topics, does different message rate on input topics affect kafka streams processing speed? Kafka流,将输出分支到多个主题 - Kafka streams, branched output to multiple topics 具有不同主题 ApplicationId 的多个 Kafka 流 - Multiple Kafka Streams with different topics ApplicationId 具有多个 output 主题的 Kafka 流拓扑的并发性 - Concurrency of Kafka streams topology with multiple output topics Spark Structured Streaming 从具有多个读取流的多个 Kafka 主题读取 - Spark Structured Streaming reading from multiple Kafka topics with multiple read streams Kafka:加入事件形成多个主题 - Kafka: joining events form multiple topics 从多个 Kafka 主题中消费 - Consuming from multiple Kafka topics 在 Kafka Streams 中,如何使用多个主题和分区并行化复杂操作(或子拓扑)? - In Kafka Streams, how do you parallelize complex operations (or sub-topologies) using multiple topics and partitions?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM