Samza：将消息的处理延迟到时间戳记

Question

I'm processing messages from a Kafka topic with Samza. 我正在与Samza处理来自Kafka主题的消息。 Some of the messages come with a timestamp in the future and I'd like to postpone the processing until after that timestamp. 某些消息将来会带有时间戳，我想将处理推迟到该时间戳之后。 In the meantime, I'd like to keep processing other incoming messages. 同时，我想继续处理其他传入消息。

What I tried to do is make my Task queue the messages and implement the WindowableTask to periodically check the messages if their timestamp allows to process them. 我想做的是使我的Task在消息中排队，并实现WindowableTask来定期检查消息（如果它们的时间戳允许处理它们）。 The basic idea looks like this: 基本思想如下：

public class MyTask implements StreamTask, WindowableTask {

    private HashSet<MyMessage> waitingMessages = new HashSet<>();

    @Override
    public void process(IncomingMessageEnvelope incomingMessageEnvelope, MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
        byte[] message = (byte[]) incomingMessageEnvelope.getMessage();
        MyMessage parsedMessage = MyMessage.parseFrom(message);

        if (parsedMessage.getValidFromDateTime().isBeforeNow()) {
            // Do the processing
        } else {
            waitingMessages.add(parsedMessage);
        }

    }

    @Override
    public void window(MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
        for (MyMessage message : waitingMessages) {
            if (message.getValidFromDateTime().isBeforeNow()) {
                // Do the processing and remove the message from the set
            }
        }
    }
}

This obviously has some downsides. 这显然有一些缺点。 I'd be losing my waiting messages in memory when I redeploy my task. 重新部署任务时，我将在内存中丢失等待的消息。 So I'd like to know the best practice for delaying the processing of messages with Samza. 因此，我想了解延迟使用Samza处理消息的最佳实践。 Do I need to reemit the messages to the same topic again and again until I can finally process them? 我是否需要一次又一次地将消息重新发送到同一主题，直到最终可以处理它们？ We're talking about delaying the processing for a few minutes up to 1-2 hours here. 我们在这里谈论的是将处理延迟几分钟，直到1-2个小时。

Answer 1

It's important to keep in mind, when dealing with message queues, is that they perform a very specific function in a system: they hold messages while the processor(s) are busy processing preceding messages. 在处理消息队列时，要记住的重要一点是它们在系统中执行非常特定的功能：它们在处理器忙于处理先前消息时保留消息。 It is expected that a properly-functioning message queue will deliver messages on demand. 预期功能正常的消息队列将按需传递消息。 What this implies is that as soon as a message reaches the head of the queue, the next pull on the queue will yield the message. 这意味着消息一到达队列的开头，队列中的下一次提取就会产生该消息。

Notice that delay is not a configurable part of the equation. 请注意，延迟不是方程式的可配置部分。 Instead, delay is an output variable of a system with a queue. 相反，延迟是具有队列的系统的输出变量 。 In fact, Little's Law offers some interesting insights into this. 实际上，利特尔定律对此提供了一些有趣的见解。

So, in a system where a delay is necessary (for example, to join/wait for a parallel operation to complete), you should be looking at other methods. 因此，在需要延迟的系统中（例如，加入/等待并行操作完成），您应该考虑其他方法。 Typically a queryable database would make sense in this particular instance. 通常，在此特定实例中，可查询的数据库才有意义。 If you find yourself keeping messages in a queue for a pre-set period of time, you're actually using the message queue as a database - a function it was not designed to provide. 如果您发现自己在预定的时间段内将消息保留在队列中，则实际上是将消息队列用作数据库-该功能并非旨在提供。 Not only is this risky, but it also has a high likelihood of hurting the performance of your message broker. 这不仅有风险，而且很有可能损害消息代理的性能。

Answer 2

I think you could use key-value store of Samza to keep state of your task instance instead of in-memory Set . 我认为您可以使用Samza的键值存储来保持任务实例的状态，而不是内存Set 。 It should look something like: 它看起来应该像这样：

public class MyTask implements StreamTask, WindowableTask, InitableTask {

  private KeyValueStore<String, MyMessage> waitingMessages;


  @SuppressWarnings("unchecked")
  @Override
  public void init(Config config, TaskContext context) throws Exception {
    this.waitingMessages = (KeyValueStore<String, MyMessage>) context.getStore("messages-store");
  }

  @Override
  public void process(IncomingMessageEnvelope incomingMessageEnvelope, MessageCollector messageCollector,
      TaskCoordinator taskCoordinator) {
    byte[] message = (byte[]) incomingMessageEnvelope.getMessage();
    MyMessage parsedMessage = MyMessage.parseFrom(message);

    if (parsedMessage.getValidFromDateTime().isBefore(LocalDate.now())) {
      // Do the processing
    } else {
      waitingMessages.put(parsedMessage.getId(), parsedMessage);
    }

  }

  @Override
  public void window(MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
    KeyValueIterator<String, MyMessage> all = waitingMessages.all();
    while(all.hasNext()) {
      MyMessage message = all.next().getValue();
      // Do the processing and remove the message from the set
    }
  }

}

If you redeploy you task Samza should recreate state of key-value store (Samza keeps values in special kafka topic related to key-value store). 如果重新部署任务，Samza应该重新创建键值存储的状态（Samza将值保留在与键值存储相关的特殊kafka主题中）。 You need of course provide some extra configuration of your store (in above example for messages-store ). 当然，您需要为商店提供一些额外的配置（在上面的messages-store示例中）。

You could read about key-value store here (for the latest Samza version): https://samza.apache.org/learn/documentation/0.14/container/state-management.html 您可以在此处（有关最新的Samza版本）阅读有关键值存储的信息： https ://samza.apache.org/learn/documentation/0.14/container/state-management.html

Samza：将消息的处理延迟到时间戳记

问题描述

2 个解决方案

解决方案1
0 2018-01-22 17:37:29

解决方案2
0 已采纳 2018-01-31 14:52:06

Samza：将消息的处理延迟到时间戳记

问题描述

2 个解决方案

解决方案1 0 2018-01-22 17:37:29

解决方案2 0 已采纳 2018-01-31 14:52:06

解决方案1
0 2018-01-22 17:37:29

解决方案2
0 已采纳 2018-01-31 14:52:06