Kafka Streams - 是否可以减少多个聚合创建的内部主题的数量

Question

I have a Kafka Streams app that groups incoming messages by several values. 我有一个Kafka Streams应用程序，它按几个值对传入的消息进行分组。 For example: 例如：

Example message: 示例消息：

{ "gender": "female", "location": "canada", "age-group": "25-30" }

Topology: 拓扑结构：

table
    .groupBy((key, value) -> groupByGender) // example key: female
    .count("gender-counts");

table
    .groupBy((key, value) -> groupByLocation) // example key: canada
    .count("location-counts");

table
    .groupBy((key, value) -> groupByAgeGroup) // example key: 25-30
    .count("age-group-counts");

This results in lots of topics: 这导致了很多主题：

my-consumer-gender-counts-changelog
my-consumer-gender-counts-repartition
my-consumer-location-counts-changelog
my-consumer-location-counts-repartition
my-consumer-age-group-counts-changelog
my-consumer-age-group-counts-repartition

It would be nice if we could send multiple aggregations to a single state store, and include the group by value as part of the key. 如果我们可以将多个聚合发送到单个状态存储，并将值按组包含在键中，那将会很好。 For example: 例如：

table
    .groupBy((key, value) -> groupByGender) // example key: female_gender
    .count("counts");

table
    .groupBy((key, value) -> groupByLocation) // example key: canada_location
    .count("counts");

table
    .groupBy((key, value) -> groupByAgeGroup) // example key: 25-30_age_group
    .count("counts");

This would result in far fewer topics: 这将导致更少的主题：

counts-changelog
counts-repartition

This currently doesn't appear to be possible (using the DSL anyways), since using the groupBy operator creates an internal topic for repartitioning, so if we have multiple sub-topologies that groupBy different things, then Kafka Streams will attempt to register the same repartitioning topic from multiple sources. 这当前似乎不可能（无论如何使用DSL），因为使用groupBy运算符会创建一个内部主题以进行重新分区，因此如果我们有多个子拓扑可以groupBy不同的东西，那么Kafka Streams将尝试注册相同的内容。从多个来源重新分配主题。 This results in the following error: 这会导致以下错误：

org.apache.kafka.streams.errors.TopologyBuilderException: Invalid topology building: Topic counts-repartition has already been registered by another source.
        at org.apache.kafka.streams.processor.TopologyBuilder.validateTopicNotAlreadyRegistered(TopologyBuilder.java:518)

If groupBy could return more than one record (eg like flatMap does), then we could return a collection of records (one record for each grouping), but this too doesn't seem to be possible using the DSL. 如果groupBy可以返回多个记录（例如像flatMap那样），那么我们可以返回一组记录（每个分组一个记录），但这似乎也不可能使用DSL。

My question is, given a single record that can be grouped by multiple values (eg { "gender": "female", "location": "canada", "age-group": "25-30" } ), should the creation of multiple topics (2 for each grouping) ever be of concern (eg what we we had 100 different groupings)? 我的问题是，给定一个可以按多个值分组的记录（例如{ "gender": "female", "location": "canada", "age-group": "25-30" } ），是否应该创建多个主题（每个分组2个）一直备受关注（例如，我们有100个不同的分组）？ Are there other strategies that might be a better fit when a single record could be grouped by several values? 当单个记录可以按多个值分组时，还有其他策略可能更适合吗？ Is what I'm proposing (sinking multiple aggregations to a single changelog topic) a bad idea (even when the number of unique keys is very low)? 我提出的建议（将多个聚合下沉到单个更改日志主题）是一个坏主意（即使唯一键的数量非常低）？

Answer 1

If you want to group by different attributes, you cannot avoid multiple repartitioning topics. 如果要按不同属性进行分组，则无法避免多个重新分区主题。 Assume you have two grouping attributes g1 and g2 and three records with the following values: 假设您有两个分组属性g1和g2以及三个具有以下值的记录：

r1 = g1:A, g2:1
r2 = g1:A, g2:2
r3 = g1:B, g2:2

Thus, to correctly aggregate the records based on g1 , records r1 and r2 must be grouped together. 因此，为了基于g1正确地聚合记录，必须将记录r1和r2组合在一起。 Assume your repartitioning topic has 2 partitions p1 and p2 , the record would get redistributes like 假设您的重新分区主题有2个分区p1和p2 ，该记录将重新分配

p1: r1, r2
p2: r3,

On the other hand, if you aggregate on r2 , records r2 and r3 must be grouped together: 另一方面，如果您在r2聚合，则记录r2和r3必须组合在一起：

p1: r1
p2: r2,r3

Note, that r2 must go to different partitions for both cases, and thus, it's not possible to use a single topic, but you need one topic per grouping. 请注意，对于这两种情况， r2必须转到不同的分区，因此，不可能使用单个主题，但每个分组需要一个主题。 (This is not Kafka specific -- any other framework would need to replicate and redistribute the date multiple times, too). （这不是Kafka特定的 - 任何其他框架都需要复制并重新分配日期多次）。

Theoretically it's possible to reduce the number of topic if you add more semantical information (like super-key, sub-key, or 1-to-1-key mapping). 从理论上讲，如果添加更多语义信息（如超密钥，子密钥或1对1密钥映射），则可以减少主题数量。 But that's not supported by Kafka Streams (and AFAIK, no other comparable system). 但Kafka Streams（和AFAIK，没有其他可比系统）不支持。

Kafka Streams - 是否可以减少多个聚合创建的内部主题的数量

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-05-24 22:24:18

Kafka Streams - 是否可以减少多个聚合创建的内部主题的数量

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-05-24 22:24:18

解决方案1
3 已采纳 2017-05-24 22:24:18