简体   繁体   English

Kafka Streams - 是否可以减少多个聚合创建的内部主题的数量

[英]Kafka Streams - is it possible to reduce the number of internal topics created by multiple aggregations

I have a Kafka Streams app that groups incoming messages by several values. 我有一个Kafka Streams应用程序,它按几个值对传入的消息进行分组。 For example: 例如:

Example message: 示例消息:

{ "gender": "female", "location": "canada", "age-group": "25-30" }

Topology: 拓扑结构:

table
    .groupBy((key, value) -> groupByGender) // example key: female
    .count("gender-counts");

table
    .groupBy((key, value) -> groupByLocation) // example key: canada
    .count("location-counts");

table
    .groupBy((key, value) -> groupByAgeGroup) // example key: 25-30
    .count("age-group-counts");

This results in lots of topics: 这导致了很多主题:

my-consumer-gender-counts-changelog
my-consumer-gender-counts-repartition
my-consumer-location-counts-changelog
my-consumer-location-counts-repartition
my-consumer-age-group-counts-changelog
my-consumer-age-group-counts-repartition

It would be nice if we could send multiple aggregations to a single state store, and include the group by value as part of the key. 如果我们可以将多个聚合发送到单个状态存储,并将值按组包含在键中,那将会很好。 For example: 例如:

table
    .groupBy((key, value) -> groupByGender) // example key: female_gender
    .count("counts");

table
    .groupBy((key, value) -> groupByLocation) // example key: canada_location
    .count("counts");

table
    .groupBy((key, value) -> groupByAgeGroup) // example key: 25-30_age_group
    .count("counts");

This would result in far fewer topics: 这将导致更少的主题:

counts-changelog
counts-repartition

This currently doesn't appear to be possible (using the DSL anyways), since using the groupBy operator creates an internal topic for repartitioning, so if we have multiple sub-topologies that groupBy different things, then Kafka Streams will attempt to register the same repartitioning topic from multiple sources. 这当前似乎不可能(无论如何使用DSL),因为使用groupBy运算符会创建一个内部主题以进行重新分区,因此如果我们有多个子拓扑可以groupBy不同的东西,那么Kafka Streams将尝试注册相同的内容。从多个来源重新分配主题。 This results in the following error: 这会导致以下错误:

org.apache.kafka.streams.errors.TopologyBuilderException: Invalid topology building: Topic counts-repartition has already been registered by another source.
        at org.apache.kafka.streams.processor.TopologyBuilder.validateTopicNotAlreadyRegistered(TopologyBuilder.java:518)

If groupBy could return more than one record (eg like flatMap does), then we could return a collection of records (one record for each grouping), but this too doesn't seem to be possible using the DSL. 如果groupBy可以返回多个记录(例如像flatMap那样),那么我们可以返回一组记录(每个分组一个记录),但这似乎也不可能使用DSL。

My question is, given a single record that can be grouped by multiple values (eg { "gender": "female", "location": "canada", "age-group": "25-30" } ), should the creation of multiple topics (2 for each grouping) ever be of concern (eg what we we had 100 different groupings)? 我的问题是,给定一个可以按多个值分组的记录(例如{ "gender": "female", "location": "canada", "age-group": "25-30" } ),是否应该创建多个主题(每个分组2个)一直备受关注(例如,我们有100个不同的分组)? Are there other strategies that might be a better fit when a single record could be grouped by several values? 当单个记录可以按多个值分组时,还有其他策略可能更适合吗? Is what I'm proposing (sinking multiple aggregations to a single changelog topic) a bad idea (even when the number of unique keys is very low)? 我提出的建议(将多个聚合下沉到单个更改日志主题)是一个坏主意(即使唯一键的数量非常低)?

If you want to group by different attributes, you cannot avoid multiple repartitioning topics. 如果要按不同属性进行分组,则无法避免多个重新分区主题。 Assume you have two grouping attributes g1 and g2 and three records with the following values: 假设您有两个分组属性g1g2以及三个具有以下值的记录:

r1 = g1:A, g2:1
r2 = g1:A, g2:2
r3 = g1:B, g2:2

Thus, to correctly aggregate the records based on g1 , records r1 and r2 must be grouped together. 因此,为了基于g1正确地聚合记录,必须将记录r1r2组合在一起。 Assume your repartitioning topic has 2 partitions p1 and p2 , the record would get redistributes like 假设您的重新分区主题有2个分区p1p2 ,该记录将重新分配

p1: r1, r2
p2: r3,

On the other hand, if you aggregate on r2 , records r2 and r3 must be grouped together: 另一方面,如果您在r2聚合,则记录r2r3必须组合在一起:

p1: r1
p2: r2,r3

Note, that r2 must go to different partitions for both cases, and thus, it's not possible to use a single topic, but you need one topic per grouping. 请注意,对于这两种情况, r2必须转到不同的分区,因此,不可能使用单个主题,但每个分组需要一个主题。 (This is not Kafka specific -- any other framework would need to replicate and redistribute the date multiple times, too). (这不是Kafka特定的 - 任何其他框架都需要复制并重新分配日期多次)。

Theoretically it's possible to reduce the number of topic if you add more semantical information (like super-key, sub-key, or 1-to-1-key mapping). 从理论上讲,如果添加更多语义信息(如超密钥,子密钥或1对1密钥映射),则可以减少主题数量。 But that's not supported by Kafka Streams (and AFAIK, no other comparable system). 但Kafka Streams(和AFAIK,没有其他可比系统)不支持。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Kafka Streams-内部主题的ACL - Kafka Streams - ACLs for Internal Topics Kafka Streams 内部主题重定向 - Kafka Streams Internal Topics Redirection 卡夫卡代理中为卡夫卡流应用程序创建的卡夫卡内部主题的UnknownProducerIdException过多 - Too many UnknownProducerIdException in kafka broker for kafka internal topics created for kafka streams application 具有多个 output 主题的 Kafka 流拓扑的并发性 - Concurrency of Kafka streams topology with multiple output topics 具有不同主题 ApplicationId 的多个 Kafka 流 - Multiple Kafka Streams with different topics ApplicationId Kafka流,将输出分支到多个主题 - Kafka streams, branched output to multiple topics 创建32个主题后,Kafka无法创建内部主题 - Kafka could not create internal topics after 32 topics created Kafka Streams:我可以将我的处理器订阅到具有不同分区数量的多个主题吗 - Kafka Streams: Can I subscribe my processor to multiple topics with different number of partitions Kafka 内部主题:创建的内部主题在哪里 - 源或目标代理? - Kafka internal topic : Where are the internal topics created - source or target broker? 更改 Kafka Streams 内部主题的复制因子会影响 kafka 流吗? 流媒体会处于错误状态吗? - Will changing replication factor of Kafka Streams internal topics affect kafka streams? Will streaming be in error state?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM