简体   繁体   中英

Kafka consuming from multiple topics with respect to timestamp

I need to create a consumer that is able to pull from multiple topics and order messages with respect to timestamp (the Kafka message timestamp)

Kind of like that: 在此处输入图片说明 (Sorry for the bad drawing...)

In this example I subscribe to "Topic A" and "Topic B" and queue the messages in the order of their timestamp

Now, as long as all topics have only one partition, this is easily solvable with this pseudo code:

kafka.subscribe(['topicA', 'topicB'])
messagesByTopic = {}
finalMessageQueue = []
while true:
    records = kafka.poll()
    for record in records:
        messagesByTopic[record.topic()].enqueue(record)

    while messagesByTopic.any(queue => !queue.notEmpty()):
        minQueue = messagesByTopic.min(queue => queue.peek().timestamp)
        finalMessageQueue.enqueue(minQueue.pop())

The problem arises when I introduce multiple partitions to each topic. Obviously, it is impossible to sort multiple topics into a single stream ordered by time, because order is not guaranteed inside a topic, only inside a partition, so the new problem is to sort multiple topics into streams that have the same key.

在此处输入图片说明

Imagine 2 topics, order and withdrawal The key of the messages inside the topics is the customer id to whom the transaction belongs.

The objective is to stream all topics into queues (one for each customer), sorted by timestamp.

It should be possible theoretically since the messages in the order and withdrawal topics are ordered by timestamp per customer, and in fact, when dealing with a single partition per topic, this problem is easily solvable.

Now, consider the case of 2 orders partitions and 1 withdrawal partition, what happens if i have two processes running simultaneously? one process will have withdrawal of all customers but orders of only half of the customers and the second process will only have orders of half of the customers, it breaks apart.

The only way is to somehow tell Kafka to make sure that same keys (even from different topics) will always be routed to the same process, but as far as I know, there is no way to do it.

I am stuck. I need an idea of how to approach it.

To achieve desired effect you should ensure a certain correspondence of partitions between both topics either by changing how messages are partitioned by the producers or by re-partitioning the data from the original topics into new intermediate topics before your ordering logic. Ideally, you would have 1-to-1 correspondence between partitions of both topics. In general your parallelism (number of threads) is bounded by the highest common factor between the partition counts of both topics eg if orders topic has 12 partitions and withdrawal topic has 9, then you can assign the partitions to HCF(12,9) = 3 threads as follows: Thread 1: orders partitions (0,1,2,3), withdrawal partitions (0,1,2) Thread 2: orders partitions (4,5,6,7), withdrawal partitions (3,4,5) Thread 3: orders partitions (8,9,10,11), withdrawal partitions (6,7,8) For this to work you would need to implement a custom partitioning for both topics instead of the default one.

However, if one topic has 1 partition and the other 2, then HCF(1,2) is 1, meaning you can only do this in a single-threaded way.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM