简体   繁体   English

Kafka Streams:Stream 线程与多个主题的分区

[英]Kafka Streams: Stream Thread vs Partition of multiple topics

Suppose I have 2 topics say xyz1, xyz2, each having 3 partitions.假设我有 2 个主题,比如 xyz1、xyz2,每个主题有 3 个分区。 If I have a single Kafka stream application having 3 threads, can the following scenario occur?如果我有一个具有 3 个线程的 Kafka stream 应用程序,会出现以下情况吗?

Thread                  Partition
    1       xyz1-partition 0, xyz2-partition 2
    2       xyz1-partition 1, xyz2-partition 0
    3       xyz1-partition 2, xyz2-partition 1

as opposed to:相对于:

Thread                  Partition
    1       xyz1-partition 0, xyz2-partition 0
    2       xyz1-partition 1, xyz2-partition 1
    3       xyz1-partition 2, xyz2-partition 2

Essentially, a single thread consuming data from a particular partition of 2 different topics and the partition number can be varying?本质上,单个线程消耗来自 2 个不同主题的特定分区的数据,并且分区号可能会有所不同? Assuming we use low-level processor API假设我们使用低级处理器 API

If the scenario may occur or not depends on your topology.是否会发生这种情况取决于您的拓扑。

Actually, stream tasks are assigned to stream threads, not plain partitions.实际上,stream 任务分配给 stream 线程,而不是普通分区。 Each task may process a group of partitions.每个任务可以处理一组分区。 One group contains one or more partitions.一组包含一个或多个分区。 If the group contains multiple partitions, it always contains the same partitions (ie, the ones with the same partition number) of different topics.如果组包含多个分区,它总是包含不同主题的相同分区(即具有相同分区号的分区)。 For example, a group may contain xyz1-partition 0, xyz2-partition 0 but not xyz1-partition 0, xyz2-partition 2. This assumes that the different topics use the same partition strategy.例如,一个组可能包含 xyz1-partition 0、xyz2-partition 0 但不包含 xyz1-partition 0、xyz2-partition 2。这假设不同的主题使用相同的分区策略。 Such a co-partitioning of the same partitions of different topics is needed -- for example -- in the case of a join where records with the same key must be processed by the same stream task, similarly as in your second scenario.需要对不同主题的相同分区进行这种共同分区 - 例如 - 在连接的情况下,具有相同键的记录必须由相同的 stream 任务处理,类似于您的第二种情况。

If you assume that in your first example each partition is processed by a different stream task, ie, each partition group contains one partition, the scenario may occur.如果假设在第一个示例中每个分区由不同的 stream 任务处理,即每个分区组包含一个分区,则可能会发生这种情况。

If you assume that both partitions on each line are processed by the same stream task (ie both partitions are part of the same partition group), the scenario cannot occur, because partition groups cannot contain different partitions.如果假设每一行的两个分区都由同一个 stream 任务处理(即两个分区属于同一个分区组),则不会出现这种情况,因为分区组不能包含不同的分区。

For more information on the assignment strategy see https://github.com/apache/kafka/blob/e4262471c9aee4a4c04dd04ebbdbdba7e3c5ead1/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsPartitionAssignor.java#L297有关分配策略的更多信息,请参阅https://github.com/apache/kafka/blob/e4262471c9aee4a4c04dd04ebbdbdba7e3c5ead1/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamsPartitionAssignor.java#L

Said that, actually there is a way to assign different partitions to the same task by implementing the PartitionGrouper interface.说了这么多,其实有一种方法可以通过实现PartitionGrouper接口给同一个任务分配不同的分区。 However, this interface will be deprecated in the 2.4 and removed in 3.0.但是,此接口将在 2.4 中弃用并在 3.0 中删除。 See https://cwiki.apache.org/confluence/display/KAFKA/KIP-528%3A+Deprecate+PartitionGrouper+configuration+and+interface .请参阅https://cwiki.apache.org/confluence/display/KAFKA/KIP-528%3A+Deprecate+PartitionGrouper+configuration+and+interface

Its depends看情况

Plain Kafka Consumer:普通卡夫卡消费者:

Kafka Consumer Group consists pool of consumers/instances/processes with the same group.id can either be running on the same machine or distributed machines. Kafka 消费者组由具有相同 group.id 的消费者/实例/进程池组成。可以在同一台机器上运行,也可以在分布式机器上运行。 Kafka Consumer uses rebalancing to assign partitions on each consumer without overlapping mean one partition can assign at most one consumer process of Consumer Group. Kafka Consumer使用rebalancing在每个consumer上分配partition而不重叠,意味着一个partition最多可以分配Consumer Group的一个consumer进程。

It is also possible for the consumer to manually assign specific partitions (similar to the older "simple" consumer) using assign(Collection).消费者也可以使用 assign(Collection) 手动分配特定的分区(类似于旧的“简单”消费者)。 In this case, dynamic partition assignment and consumer group coordination will be disabled在这种情况下,动态分区分配和消费者组协调将被禁用

So in case of partition can be assigned to any thread while rebalancing.所以在分区的情况下可以在重新平衡时分配给任何线程。

在此处输入图像描述

Kafka Stream:卡夫卡 Stream:

Kafka uses stream tasks as a logical unit to assign partition and parallelize process. Kafka 使用 stream 任务作为逻辑单元来分配分区和并行化进程。 Kafka Stream creates a number of stream task based on stream partitions and assigns a list of partitions to each task. Kafka Stream 基于 stream 分区创建多个 stream 任务,并为每个任务分配一个分区列表。 Once the task assigned to partitions it will stick and manage parallelism on their own topology.一旦将任务分配给分区,它将坚持并管理它们自己的拓扑上的并行性。 As a result stream tasks can be processed independently and in parallel without manual intervention.因此,stream 任务可以独立并行处理,无需人工干预。

Default implementation of the PartitionGrouper interface that groups partitions by the partition id.按分区 ID 对分区进行分组的 PartitionGrouper 接口的默认实现。 Join operations requires that topics of the joining entities are partitioned, ie, being partitioned by the same key and having the same number of partitions.连接操作需要对连接实体的主题进行分区,即按相同的键进行分区,并且具有相同的分区数。 Copartitioning is ensured by having the same number of partitions on joined topics, and by using the serialization and Producer's default partitioner.通过在连接主题上具有相同数量的分区以及使用序列化和生产者的默认分区器来确保共同分区。 here 这里

So in your case scenario-1 not possible whereas scenario-2 is possible.因此,在您的情况下,场景 1 不可能,而场景 2 是可能的。

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM