简体   繁体   English

Kafka Streams - 更新KTable上的聚合

[英]Kafka Streams - updating aggregations on KTable

I have a KTable with data that looks like this (key => value), where keys are customer IDs, and values are small JSON objects containing some customer data: 我有一个KTable ,其数据看起来像这样(key => value),其中key是客户ID,值是包含一些客户数据的小JSON对象:

1 => { "name" : "John", "age_group":  "25-30"}
2 => { "name" : "Alice", "age_group": "18-24"}
3 => { "name" : "Susie", "age_group": "18-24" }
4 => { "name" : "Jerry", "age_group": "18-24" }

I'd like to do some aggregations on this KTable , and basically keep a count of the number of records for each age_group . 我想对这个KTable进行一些聚合,并且基本上保留每个age_group的记录数。 The desired KTable data would look like this: 所需的KTable数据如下所示:

"18-24" => 3
"25-30" => 1

Lets say Alice , who is in the 18-24 group above, has a birthday that puts her in the new age group. 让我们说Alice ,她在上面的18-24组,有一个生日,让她进入新的年龄组。 The state store backing the first KTable should now look like this: 支持第一个KTable的状态存储现在应该如下所示:

1 => { "name" : "John", "age_group":  "25-30"}
2 => { "name" : "Alice", "age_group": "25-30"} # Happy Cake Day
3 => { "name" : "Susie", "age_group": "18-24" }
4 => { "name" : "Jerry", "age_group": "18-24" }

And I'd like the resulting aggregated KTable results to reflect this. 我希望得到的聚合KTable结果能够反映出这一点。 eg 例如

"18-24" => 2
"25-30" => 2

I may be overgeneralizing the issue described here : 可能过度概括了这里描述的问题:

In Kafka Streams there is no such thing as a final aggregation... Depending on your use case, manual de-duplication would be a way to resolve the issue" 在Kafka Streams中没有最终聚合......根据您的使用情况,手动重复数据删除将是解决问题的一种方法“

But I have only been able to calculate a running total so far, eg Alice's birthday would be interpreted as: 但到目前为止我只能算出一个跑步总数,例如爱丽丝的生日会被解释为:

"18-24" => 3 # Old Alice record still gets counted here
"25-30" => 2 # New Alice record gets counted here as well

Edit: here is some additional behavior that I noticed that seems unexpected. 编辑:这是我注意到的一些额外行为似乎意外。

The topology I'm using looks like: 我正在使用的拓扑看起来像:

dataKTable = builder.table("compacted-topic-1", "users-json")
    .groupBy((key, value) -> KeyValue.pair(getAgeRange(value), key))
    .count("age-range-counts")

1) Empty State 1)空状态

Now, from the initial, empty state, everything looks like this: 现在,从最初的空状态开始,一切看起来像这样:

compacted-topic-1
(empty)


dataKTable
(empty)


// groupBy()
Repartition topic: $APP_ID-age-range-counts-repartition
(empty)

// count()
age-range-counts state store
(empty)

2) Send a couple of messages 2)发送几条消息

Now, lets send a message to the compacted-topic-1 , which is streamed as a KTable above. 现在,让我们向compacted compacted-topic-1发送一条消息,该消息在上面作为KTable流传输。 Here is what happens: 这是发生的事情:

compacted-topic-1
3 => { "name" : "Susie", "age_group": "18-24" }
4 => { "name" : "Jerry", "age_group": "18-24" }

dataKTable
3 => { "name" : "Susie", "age_group": "18-24" }
4 => { "name" : "Jerry", "age_group": "18-24" }


// groupBy()
// why does this generate 4 events???
Repartition topic: $APP_ID-age-range-counts-repartition
18-24 => 3
18-24 => 3
18-24 => 4
18-24 => 4

// count()
age-range-counts state store
18-24 => 0

So I'm wondering: 所以我想知道:

  • Is what I'm trying to do even possible using Kafka Streams 0.10.1 or 0.10.2? 我正在尝试使用Kafka Streams 0.10.1或0.10.2做什么? I've tried using groupBy and count in the DSL, but maybe I need to use something like reduce ? 我尝试过使用groupBycount DSL,但也许我需要使用像reduce这样的东西?
  • Also, I'm having a little trouble understanding the circumstances that lead to the add reducer and the subtract reducer being called, so any clarification around any of these points will be greatly appreciated. 另外,我在理解导致add reducer和subtract减速器被调用的情况时遇到了一些麻烦,因此任何这些点的任何澄清将不胜感激。

If you have your original KTable containing id -> Json data (let's call it dataKTable ) you should be able to get what you want via 如果你有原始的KTable包含id -> Json数据(让我们称之为dataKTable )你应该能够得到你想要的东西

KTable countKTablePerRange
    = dataKTable.groupBy(/* map your age-range to be the key*/)
                .count("someStoreName");

This should work for all versions of Kafka's Streams API. 这适用于所有版本的Kafka Streams API。

Update 更新

About the 4 values in the re-partitioning topic: that's correct. 关于重新分区主题中的4个值:这是正确的。 Each update to the "base KTable " writes a record for it's "old value" and it's "new value". 对“base KTable ”的每次更新KTable记录“旧值”并记录“新值”。 This is required to update the downstream KTable correctly. 这是正确更新下游KTable所必需的。 The old value must be removed from one count and the new value must be added to another count. 必须从一个计数中删除旧值,并且必须将新值添加到另一个计数中。 Because your (count) KTable is potentially distributed (ie, shared over multiple parallel running app instances), both records (old and new) might end up at different instances because both might have different key and thus they must be sent as two independent records. 由于您的(计数) KTable可能是分布式的(即,在多个并行运行的应用程序实例上共享),因此两个记录(旧的和新的)可能最终会出现在不同的实例中,因为它们可能具有不同的密钥,因此它们必须作为两个独立的记录发送。 (The record format should be more complex that you show in your question though.) (记录格式应该比你在问题中显示的更复杂。)

This also explains, why you need a subtractor and an adder. 这也解释了为什么你需要一个减法器和一个加法器。 The subtractor removes old record from the agg result, while the adder adds new record to the agg result. 减法器从agg结果中删除旧记录,而加法器将新记录添加到agg结果中。

Still not sure why you don't see the correct count in the result. 仍然不确定为什么你没有在结果中看到正确的计数。 How many instanced to you run? 你运行了多少个实例? Maybe try to disable KTable cache by setting cache.max.bytes.buffering=0 in StreamsConfig . 也许尝试通过在StreamsConfig设置cache.max.bytes.buffering=0来禁用KTable缓存。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM