简体繁体 English

可靠的Webhook调度系统

[英]Reliable Webhook dispatching system

原文 2021-08-21 14:59:05 0 3 apache-kafka/ rabbitmq/ webhooks/ event-dispatching

I am having a hard time figuring out a reliable and scalable solution for a webhook dispatch system.我很难为 webhook 调度系统找出可靠且可扩展的解决方案。

The current system uses RabbitMQ with a queue for webhooks (let's call it events ), which are consumed and dispatched.当前系统使用RabbitMQ和 webhooks 队列（我们称之为events ），这些队列被消费和分派。 This system worked for some time, but now there are a few problems:这个系统运行了一段时间，但现在有几个问题：

If a system user generates too many events, it will take up the queue causing other users to not receive webhooks for a long time如果一个系统用户产生过多的事件，会占用队列导致其他用户长时间收不到webhooks
If I split all events into multiple queues (by URL hash), it reduces the possibility of the first problem, but it still happens from time to time when a very busy user hits the same queue如果我将所有事件分成多个队列（通过 URL 哈希），它会减少第一个问题的可能性，但是当一个非常繁忙的用户访问同一个队列时，它仍然会不时发生
If I try to put each URL into its own queue, the challenge is to dynamically create/assign consumers to those queues.如果我尝试将每个 URL 放入其自己的队列中，挑战是动态创建/分配消费者到这些队列。 As far as RabbitMQ documentation goes, the API is very limited in filtering for non-empty queues or for queues that do not have consumers assigned.就RabbitMQ文档而言，API 在过滤非空队列或未分配消费者的队列方面非常有限。
As far as Kafka goes, as I understand from reading everything about it, the situation will be the same in the scope of a single partition.就Kafka而言，正如我从阅读有关它的所有内容中了解到的那样，在单个分区的范围内情况将是相同的。

So, the question is - is there a better way/system for this purpose?所以，问题是 - 有没有更好的方法/系统用于此目的？ Maybe I am missing a very simple solution that would allow one user to not interfere with another user?也许我错过了一个非常简单的解决方案，可以让一个用户不干扰另一个用户？

Thanks in advance!提前致谢！

3 个解决方案

You may experiment several rabbitmq features to mitigate your issue (without removing it completly):您可以尝试几个 rabbitmq 功能来缓解您的问题（无需完全删除它）：

Use a public random exchange to split events across several queues.使用公共随机交换将事件拆分到多个队列中。 It will mitigate large spikes of events and dispatch work to several consumers.它将减轻事件的大量峰值并将工作分派给多个消费者。
Set some TTL policies to your queues.为您的队列设置一些TTL 策略。 This way, Rabbitmq may republish events to another group of queues (through another private random exchange for example) if they are not processed fast enough.这样，如果处理速度不够快，Rabbitmq 可能会将事件重新发布到另一组队列（例如，通过另一个私有随机交换）。

You may have several "cycles" of events, varying configuration (ie number of cycles and TTL value for each cycle).您可能有多个事件“周期”，不同的配置（即每个周期的周期数和 TTL 值）。 Your first cycle handles fresh events the best it can, mitigating spikes through several queues under a random exchange.您的第一个周期会尽其所能地处理新事件，在随机交换下通过多个队列缓解峰值。 If it fails to handle events fast enough, events are moved to another cycle with dedicated queues and consumers.如果它不能足够快地处理事件，事件将被移动到另一个具有专用队列和消费者的循环。

This way, you can ensure that fresh events have a better change to be handled quickly, as they will always be published in the first cycle (and not behind a pile of old events from another user).这样，您可以确保新事件有更好的更改可以快速处理，因为它们将始终在第一个周期中发布（而不是在其他用户的一堆旧事件之后）。

If you need order, unfortunatelly you depend on user input.如果您需要订单，不幸的是您依赖于用户输入。

But in Kafka world, there are a few things to mention here;但在卡夫卡的世界里，这里有几件事要提；

You can achieve exactly-once delivery with Transactions which allows you to build a similar system like regular AMQPs.您可以使用Transactions实现exactly-once交付，这允许您构建类似常规 AMQP 的系统。
Kafka supports partitioning by key. Kafka 支持按键分区。 Which allows you to keep processing order of the same keys (in your case userId).这允许您保持相同键的处理顺序（在您的情况下为 userId）。
Throughput can be increased by tuning all producer, server and consumer sides (batch-size, inflight-requests etc. see Kafka documentation for more parameters).可以通过调整所有生产者、服务器和消费者端（批量大小、飞行中请求等，请参阅Kafka 文档了解更多参数）来增加吞吐量。
Kafka supports message compression which is reduces network traffic and increases throughtput (just consumes a little more CPU power for fast compression algorithms like LZ4). Kafka 支持消息压缩，这可以减少网络流量并增加吞吐量（对于像 LZ4 这样的快速压缩算法只消耗更多的 CPU 能力）。

Partitions are most important thing in the scenario of yours.在你的场景中，分区是最重要的。 You can increase partitions to process more messages in the same time.您可以增加分区以同时处理更多消息。 Your consumers can be as much as your partitions in the same consumer-gorup.您的消费者可以与同一消费者组中的分区一样多。 Even if you scale after reaching partition count, your new consumers won't be able to read and they will stay unassigned.即使您在达到分区数后进行扩展，您的新使用者也将无法读取并且他们将保持未分配状态。

Unlike regular AMQP services Kafka does not remove messages after you read it, just marks offsets for consumer-gorup-id.与常规的 AMQP 服务不同，Kafka 不会在您阅读消息后删除消息，只是标记消费者组 ID 的偏移量。 This allows you to do a few things at the same time.这允许您同时做几件事。 Like calculating realtime user count in a separate process.就像在单独的过程中计算实时用户数一样。

So, I am not sure if this is the correct way to solve this problem, but this is what I came up with.所以，我不确定这是否是解决这个问题的正确方法，但这是我想出的。

Prerequisites: RabbitMQ with deduplication plugin先决条件：带有重复数据删除插件的 RabbitMQ

So my solution involves:所以我的解决方案包括：

g:events queue - let's call it a parent queue. g:events队列 - 我们称之为parent队列。 This queue will contain the names of all child queues that need to be processed.该队列将包含需要处理的所有child队列的名称。 Probably it can be replaced with some other mechanism (like Redis sorted Set or something), but you would have to implement ack logic yourself then.可能它可以用其他一些机制（如 Redis sorted Set 或其他机制）替换，但你必须自己实现 ack 逻辑。
g:events:<url> - there are the child queues. g:events:<url> - 有child队列。 Each queue contains only events that are need to be sent out to that url .每个队列只包含需要发送到该url 。

When posting a webhook payload to RabbitMQ, you post the actual data to the child queue, and then additionally post the name of the child queue to the parent queue.将 webhook 负载发布到 RabbitMQ 时，您将实际数据发布到child队列，然后另外将child队列的名称发布到parent队列。 The deduplication plugin won't allow the same child queue to be posted twice, meaning that only a single consumer may receive that child queue for processing.重复数据删除插件不允许同一child队列被发布两次，这意味着只有一个消费者可以接收该child队列进行处理。

All you consumers are consuming the parent queue, and after receiving a message, they start consuming the child queue specified in the message.所有消费者都在消费parent队列，在收到消息后，他们开始消费消息中指定的child队列。 After the child queue is empty, you acknowledge the parent message and move on. child队列为空后，您确认parent消息并继续。

This method allows for very fine control over which child queues are allowed to be processed.此方法允许非常精细地控制允许处理哪些child队列。 If some child queue is taking too much time, just ack the parent message and republish the same data to the end of the parent queue.如果某个child队列花费太多时间，只需ack parent消息并将相同的数据重新发布到parent队列的末尾。

I understand that this is probably not the most effective way (there's also a bit of overhead for constantly posting to the parent queue), but it is what it is.我知道这可能不是最有效的方法（不断向parent队列发布也有一些开销），但事实就是如此。