[英]Samza: Delay processing of messages until timestamp
I'm processing messages from a Kafka topic with Samza. 我正在与Samza处理来自Kafka主题的消息。 Some of the messages come with a timestamp in the future and I'd like to postpone the processing until after that timestamp. 某些消息将来会带有时间戳,我想将处理推迟到该时间戳之后。 In the meantime, I'd like to keep processing other incoming messages. 同时,我想继续处理其他传入消息。
What I tried to do is make my Task
queue the messages and implement the WindowableTask
to periodically check the messages if their timestamp allows to process them. 我想做的是使我的Task
在消息中排队,并实现WindowableTask
来定期检查消息(如果它们的时间戳允许处理它们)。 The basic idea looks like this: 基本思想如下:
public class MyTask implements StreamTask, WindowableTask {
private HashSet<MyMessage> waitingMessages = new HashSet<>();
@Override
public void process(IncomingMessageEnvelope incomingMessageEnvelope, MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
byte[] message = (byte[]) incomingMessageEnvelope.getMessage();
MyMessage parsedMessage = MyMessage.parseFrom(message);
if (parsedMessage.getValidFromDateTime().isBeforeNow()) {
// Do the processing
} else {
waitingMessages.add(parsedMessage);
}
}
@Override
public void window(MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
for (MyMessage message : waitingMessages) {
if (message.getValidFromDateTime().isBeforeNow()) {
// Do the processing and remove the message from the set
}
}
}
}
This obviously has some downsides. 这显然有一些缺点。 I'd be losing my waiting messages in memory when I redeploy my task. 重新部署任务时,我将在内存中丢失等待的消息。 So I'd like to know the best practice for delaying the processing of messages with Samza. 因此,我想了解延迟使用Samza处理消息的最佳实践。 Do I need to reemit the messages to the same topic again and again until I can finally process them? 我是否需要一次又一次地将消息重新发送到同一主题,直到最终可以处理它们? We're talking about delaying the processing for a few minutes up to 1-2 hours here. 我们在这里谈论的是将处理延迟几分钟,直到1-2个小时。
It's important to keep in mind, when dealing with message queues, is that they perform a very specific function in a system: they hold messages while the processor(s) are busy processing preceding messages. 在处理消息队列时,要记住的重要一点是它们在系统中执行非常特定的功能:它们在处理器忙于处理先前消息时保留消息。 It is expected that a properly-functioning message queue will deliver messages on demand. 预期功能正常的消息队列将按需传递消息。 What this implies is that as soon as a message reaches the head of the queue, the next pull on the queue will yield the message. 这意味着消息一到达队列的开头,队列中的下一次提取就会产生该消息。
Notice that delay is not a configurable part of the equation. 请注意,延迟不是方程式的可配置部分。 Instead, delay is an output variable of a system with a queue. 相反,延迟是具有队列的系统的输出变量 。 In fact, Little's Law offers some interesting insights into this. 实际上, 利特尔定律对此提供了一些有趣的见解。
So, in a system where a delay is necessary (for example, to join/wait for a parallel operation to complete), you should be looking at other methods. 因此,在需要延迟的系统中(例如,加入/等待并行操作完成),您应该考虑其他方法。 Typically a queryable database would make sense in this particular instance. 通常,在此特定实例中,可查询的数据库才有意义。 If you find yourself keeping messages in a queue for a pre-set period of time, you're actually using the message queue as a database - a function it was not designed to provide. 如果您发现自己在预定的时间段内将消息保留在队列中,则实际上是将消息队列用作数据库-该功能并非旨在提供。 Not only is this risky, but it also has a high likelihood of hurting the performance of your message broker. 这不仅有风险,而且很有可能损害消息代理的性能。
I think you could use key-value store of Samza to keep state of your task instance instead of in-memory Set
. 我认为您可以使用Samza的键值存储来保持任务实例的状态,而不是内存Set
。 It should look something like: 它看起来应该像这样:
public class MyTask implements StreamTask, WindowableTask, InitableTask {
private KeyValueStore<String, MyMessage> waitingMessages;
@SuppressWarnings("unchecked")
@Override
public void init(Config config, TaskContext context) throws Exception {
this.waitingMessages = (KeyValueStore<String, MyMessage>) context.getStore("messages-store");
}
@Override
public void process(IncomingMessageEnvelope incomingMessageEnvelope, MessageCollector messageCollector,
TaskCoordinator taskCoordinator) {
byte[] message = (byte[]) incomingMessageEnvelope.getMessage();
MyMessage parsedMessage = MyMessage.parseFrom(message);
if (parsedMessage.getValidFromDateTime().isBefore(LocalDate.now())) {
// Do the processing
} else {
waitingMessages.put(parsedMessage.getId(), parsedMessage);
}
}
@Override
public void window(MessageCollector messageCollector, TaskCoordinator taskCoordinator) {
KeyValueIterator<String, MyMessage> all = waitingMessages.all();
while(all.hasNext()) {
MyMessage message = all.next().getValue();
// Do the processing and remove the message from the set
}
}
}
If you redeploy you task Samza should recreate state of key-value store (Samza keeps values in special kafka topic related to key-value store). 如果重新部署任务,Samza应该重新创建键值存储的状态(Samza将值保留在与键值存储相关的特殊kafka主题中)。 You need of course provide some extra configuration of your store (in above example for messages-store
). 当然,您需要为商店提供一些额外的配置(在上面的messages-store
示例中)。
You could read about key-value store here (for the latest Samza version): https://samza.apache.org/learn/documentation/0.14/container/state-management.html 您可以在此处(有关最新的Samza版本)阅读有关键值存储的信息: https ://samza.apache.org/learn/documentation/0.14/container/state-management.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.