[英]Kafka Streams - is it possible to reduce the number of internal topics created by multiple aggregations
I have a Kafka Streams app that groups incoming messages by several values. 我有一个Kafka Streams应用程序,它按几个值对传入的消息进行分组。 For example: 例如:
Example message: 示例消息:
{ "gender": "female", "location": "canada", "age-group": "25-30" }
Topology: 拓扑结构:
table
.groupBy((key, value) -> groupByGender) // example key: female
.count("gender-counts");
table
.groupBy((key, value) -> groupByLocation) // example key: canada
.count("location-counts");
table
.groupBy((key, value) -> groupByAgeGroup) // example key: 25-30
.count("age-group-counts");
This results in lots of topics: 这导致了很多主题:
my-consumer-gender-counts-changelog
my-consumer-gender-counts-repartition
my-consumer-location-counts-changelog
my-consumer-location-counts-repartition
my-consumer-age-group-counts-changelog
my-consumer-age-group-counts-repartition
It would be nice if we could send multiple aggregations to a single state store, and include the group by value as part of the key. 如果我们可以将多个聚合发送到单个状态存储,并将值按组包含在键中,那将会很好。 For example: 例如:
table
.groupBy((key, value) -> groupByGender) // example key: female_gender
.count("counts");
table
.groupBy((key, value) -> groupByLocation) // example key: canada_location
.count("counts");
table
.groupBy((key, value) -> groupByAgeGroup) // example key: 25-30_age_group
.count("counts");
This would result in far fewer topics: 这将导致更少的主题:
counts-changelog
counts-repartition
This currently doesn't appear to be possible (using the DSL anyways), since using the groupBy
operator creates an internal topic for repartitioning, so if we have multiple sub-topologies that groupBy
different things, then Kafka Streams will attempt to register the same repartitioning topic from multiple sources. 这当前似乎不可能(无论如何使用DSL),因为使用groupBy
运算符会创建一个内部主题以进行重新分区,因此如果我们有多个子拓扑可以groupBy
不同的东西,那么Kafka Streams将尝试注册相同的内容。从多个来源重新分配主题。 This results in the following error: 这会导致以下错误:
org.apache.kafka.streams.errors.TopologyBuilderException: Invalid topology building: Topic counts-repartition has already been registered by another source.
at org.apache.kafka.streams.processor.TopologyBuilder.validateTopicNotAlreadyRegistered(TopologyBuilder.java:518)
If groupBy
could return more than one record (eg like flatMap
does), then we could return a collection of records (one record for each grouping), but this too doesn't seem to be possible using the DSL. 如果groupBy
可以返回多个记录(例如像flatMap
那样),那么我们可以返回一组记录(每个分组一个记录),但这似乎也不可能使用DSL。
My question is, given a single record that can be grouped by multiple values (eg { "gender": "female", "location": "canada", "age-group": "25-30" }
), should the creation of multiple topics (2 for each grouping) ever be of concern (eg what we we had 100 different groupings)? 我的问题是,给定一个可以按多个值分组的记录(例如{ "gender": "female", "location": "canada", "age-group": "25-30" }
),是否应该创建多个主题(每个分组2个)一直备受关注(例如,我们有100个不同的分组)? Are there other strategies that might be a better fit when a single record could be grouped by several values? 当单个记录可以按多个值分组时,还有其他策略可能更适合吗? Is what I'm proposing (sinking multiple aggregations to a single changelog topic) a bad idea (even when the number of unique keys is very low)? 我提出的建议(将多个聚合下沉到单个更改日志主题)是一个坏主意(即使唯一键的数量非常低)?
If you want to group by different attributes, you cannot avoid multiple repartitioning topics. 如果要按不同属性进行分组,则无法避免多个重新分区主题。 Assume you have two grouping attributes g1
and g2
and three records with the following values: 假设您有两个分组属性g1
和g2
以及三个具有以下值的记录:
r1 = g1:A, g2:1
r2 = g1:A, g2:2
r3 = g1:B, g2:2
Thus, to correctly aggregate the records based on g1
, records r1
and r2
must be grouped together. 因此,为了基于g1
正确地聚合记录,必须将记录r1
和r2
组合在一起。 Assume your repartitioning topic has 2 partitions p1
and p2
, the record would get redistributes like 假设您的重新分区主题有2个分区p1
和p2
,该记录将重新分配
p1: r1, r2
p2: r3,
On the other hand, if you aggregate on r2
, records r2
and r3
must be grouped together: 另一方面,如果您在r2
聚合,则记录r2
和r3
必须组合在一起:
p1: r1
p2: r2,r3
Note, that r2
must go to different partitions for both cases, and thus, it's not possible to use a single topic, but you need one topic per grouping. 请注意,对于这两种情况, r2
必须转到不同的分区,因此,不可能使用单个主题,但每个分组需要一个主题。 (This is not Kafka specific -- any other framework would need to replicate and redistribute the date multiple times, too). (这不是Kafka特定的 - 任何其他框架都需要复制并重新分配日期多次)。
Theoretically it's possible to reduce the number of topic if you add more semantical information (like super-key, sub-key, or 1-to-1-key mapping). 从理论上讲,如果添加更多语义信息(如超密钥,子密钥或1对1密钥映射),则可以减少主题数量。 But that's not supported by Kafka Streams (and AFAIK, no other comparable system). 但Kafka Streams(和AFAIK,没有其他可比系统)不支持。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.