简体   繁体   English

Kafka从多个主题中获取有关时间戳的信息

[英]Kafka consuming from multiple topics with respect to timestamp

I need to create a consumer that is able to pull from multiple topics and order messages with respect to timestamp (the Kafka message timestamp) 我需要创建一个使用者,该使用者能够从多个主题中提取信息并针对时间戳(Kafka消息时间戳)订购消息

Kind of like that: 有点像: 在此处输入图片说明 (Sorry for the bad drawing...) (对不起,不好的图纸...)

In this example I subscribe to "Topic A" and "Topic B" and queue the messages in the order of their timestamp 在此示例中,我订阅了“主题A”和“主题B”,并按时间戳顺序将消息排队

Now, as long as all topics have only one partition, this is easily solvable with this pseudo code: 现在,只要所有主题只有一个分区,就可以使用以下伪代码轻松解决:

kafka.subscribe(['topicA', 'topicB'])
messagesByTopic = {}
finalMessageQueue = []
while true:
    records = kafka.poll()
    for record in records:
        messagesByTopic[record.topic()].enqueue(record)

    while messagesByTopic.any(queue => !queue.notEmpty()):
        minQueue = messagesByTopic.min(queue => queue.peek().timestamp)
        finalMessageQueue.enqueue(minQueue.pop())

The problem arises when I introduce multiple partitions to each topic. 当我为每个主题引入多个分区时,就会出现问题。 Obviously, it is impossible to sort multiple topics into a single stream ordered by time, because order is not guaranteed inside a topic, only inside a partition, so the new problem is to sort multiple topics into streams that have the same key. 显然,不可能将多个主题按时间排序到单个流中,因为不能保证一个主题内的排序,而不能保证一个分区内的排序,所以新的问题是将多个主题排序到具有相同键的流中。

在此处输入图片说明

Imagine 2 topics, order and withdrawal The key of the messages inside the topics is the customer id to whom the transaction belongs. 想象一下2个主题,订单和取款主题内消息的关键是交易所属的客户ID。

The objective is to stream all topics into queues (one for each customer), sorted by timestamp. 目的是将所有主题流式传输到队列(每个客户一个),并按时间戳排序。

It should be possible theoretically since the messages in the order and withdrawal topics are ordered by timestamp per customer, and in fact, when dealing with a single partition per topic, this problem is easily solvable. 从理论上讲应该是可行的,因为订购和提款主题中的消息是按每个客户的时间戳排序的,实际上,当处理每个主题的单个分区时,此问题很容易解决。

Now, consider the case of 2 orders partitions and 1 withdrawal partition, what happens if i have two processes running simultaneously? 现在,考虑2个订单分区和1个提款分区的情况,如果我同时运行两个流程会怎样? one process will have withdrawal of all customers but orders of only half of the customers and the second process will only have orders of half of the customers, it breaks apart. 一个过程将撤回所有客户,但只有一半客户的订单,而第二个过程将仅撤消一半客户的订单,这会破裂。

The only way is to somehow tell Kafka to make sure that same keys (even from different topics) will always be routed to the same process, but as far as I know, there is no way to do it. 唯一的方法是以某种方式告诉Kafka确保将相同的密钥(即使来自不同主题的密钥)也始终路由到相同的过程,但是据我所知,这是没有办法的。

I am stuck. 我被困住了。 I need an idea of how to approach it. 我需要一个解决方法的想法。

To achieve desired effect you should ensure a certain correspondence of partitions between both topics either by changing how messages are partitioned by the producers or by re-partitioning the data from the original topics into new intermediate topics before your ordering logic. 为了获得理想的效果,您应该通过更改生产者对消息的划分方式,或者通过在订购逻辑之前将数据从原始主题重新划分为新的中间主题,来确保两个主题之间的划分具有一定的对应性。 Ideally, you would have 1-to-1 correspondence between partitions of both topics. 理想情况下,两个主题的分区之间应具有1对1的对应关系。 In general your parallelism (number of threads) is bounded by the highest common factor between the partition counts of both topics eg if orders topic has 12 partitions and withdrawal topic has 9, then you can assign the partitions to HCF(12,9) = 3 threads as follows: Thread 1: orders partitions (0,1,2,3), withdrawal partitions (0,1,2) Thread 2: orders partitions (4,5,6,7), withdrawal partitions (3,4,5) Thread 3: orders partitions (8,9,10,11), withdrawal partitions (6,7,8) For this to work you would need to implement a custom partitioning for both topics instead of the default one. 通常,并行度(线程数)受两个主题的分区计数之间的最高公因数限制,例如,如果orders主题具有12个分区,而提款主题具有9个分区,则可以将分区分配给HCF(12,9)= 3个线程如下:线程1:订单分区(0,1,2,3),提款分区(0,1,2)线程2:订单分区(4,5,6,7),提款分区(3,4 ,5)线程3:对分区(8,9,10,11),提款分区(6,7,8)进行排序,为此,您需要为这两个主题实现自定义分区,而不是默认主题。

However, if one topic has 1 partition and the other 2, then HCF(1,2) is 1, meaning you can only do this in a single-threaded way. 但是,如果一个主题有1个分区,而另一个主题有2个分区,则HCF(1,2)为1,这意味着您只能以单线程方式执行此操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM