处理Akka演员异常的最佳实践

Question

I have the following task, for which I have Java/Executors solution working well but I'd like to implement same functionality in Akka and looking for best practices suggestions. 我有以下任务，我的Java / Executors解决方案运行良好，但我想在Akka中实现相同的功能并寻找最佳实践建议。

Problem: 问题：

Fetch/parse data from multiple URLs in parallel, block till all data to be fetched and return aggregated result. 从多个URL并行获取/解析数据，阻塞直到获取所有数据并返回聚合结果。 Should retry on errors (IOException etc) up to certain number of times. 应该重试错误（IOException等）达到一定次数。

My implementation so far is pretty straightforward - create Fetcher actor which knows what URLs should be fetched, it creates bunch of Worker actors and send them URLs, one per message. 到目前为止，我的实现非常简单 - 创建Fetcher actor，它知道应该获取哪些URL，它创建一堆Worker actor并发送它们URL，每个消息一个。 Once done with particular URL Worker send message back to Fetcher with a result. 完成特定的URL Worker后，将结果发送回Fetcher。 Fetcher keeps state of results, Workers stateless. Fetcher保持结果状态，工人无国籍。 Simplified code below. 以下简化代码。

Fetcher: 提取程序：

class Fetcher extends UntypedActor {
  private ActorRef worker;

  public void onReceive(Object message) throws Exception {
    if (message instanceof FetchMessage) {
      this.worker = context().actorOf(SpringExtension.SpringExtProvider.get(actorSystem).props("Worker")
              .withRouter(new RoundRobinPool(4)), "worker");
      for(URL u: urls) {
        this.worker.tell(new WorkUnit(u), getSelf());
      }
   }
   else if (message instanceof Result) {
     // accumulate results
   }
}

Worker: 工人：

class Worker extends UntypedActor {

  public void onReceive(Object message) throws Exception {
    if (message instanceof WorkUnit) {
      // fetch URL, parse etc
      // send result back to sender
      getSender().tell(new Result(...), null);
    }
}

So far so good and in absence of exceptions everything works as expected. 到目前为止，如此好，没有例外，一切都按预期工作。

But if there is say IOException in fetching URL in Worker then Akka would restart Worker actor but message that Worker was processing at the time is lost. 但是如果在Worker中获取URL时发出IOException，那么Akka会重新启动Worker actor，但是当时Worker正在处理的消息将丢失。 Even if I use different SupervisorStrategy the result is the same - some of the messages are effectively 'lost'. 即使我使用不同的SupervisorStrategy，结果也是一样的 - 有些消息实际上“丢失”了。 Of course I could have wrapped code inside Worker.onReceive() with try/catch but I feel that this goes against Akka philosophy. 当然我可以使用try / catch在Worker.onReceive（）中包装代码，但我觉得这违背了Akka哲学。 I guess I could use persistent messaging but I don't think added complexity of message persistence is justified in this case. 我想我可以使用持久性消息传递，但我不认为在这种情况下消息持久性的复杂性是合理的。

I need perhaps some way for a Fetcher to figure out that Worker failed on fetching some of the URLs and resend WorkUnit again or detect that some Results are not coming back for too long. 我可能需要某种方式让Fetcher弄清楚工人是否未能获取一些URL并再次重新发送WorkUnit或检测到某些结果没有回来太长时间。 What would be the best approach to handle this case? 处理这种情况的最佳方法是什么？

Thanks, 谢谢，

Answer 1

We had a similar problem in our project and we found a solution which works for us - the tasks are executed regardless exceptions, worker failures, network failures etc. Although I must admit that the code eventually became a bit complicated. 我们在项目中遇到了类似的问题，我们找到了一个适合我们的解决方案 - 无论例外，工作人员失败，网络故障等都执行任务。虽然我必须承认代码最终变得有点复杂。

So our setup is the following: 所以我们的设置如下：

There is a WorkerControl actor that handles the task management and communication with the workers 有一个WorkerControl actor可以处理任务管理并与worker进行通信
There is a number of Worker actors that live in a different VM (potentially on different physical machines) 有许多工作者演员住在不同的VM（可能在不同的物理机器上）
WorkerControl receives some data to be processed and dispatches the tasks between the workers WorkerControl接收一些要处理的数据，并在工作者之间分派任务

More or less we tried to follow the guidelines described here 或多或少，我们尝试遵循此处描述的准则

But we also improved the failure tolerance of the design. 但我们也提高了设计的容错能力。

In the WorkerControl we keep the following data structures: 在WorkerControl中，我们保留以下数据结构：

Map<ActorPath, ActorRef> registeredWorkers // registry of workers
Deque<TaskInfo> todoList                   // tasks that have not been yet processed
Map<ActorRef, TaskInfo> assignedTasks      // tasks assigned to the workers
Map<ActorPath, ActorRef> deadWorkers       // registry of dead workers

For each task to be executed we keep a data structure 对于要执行的每个任务，我们保留数据结构

class TaskInfo {
    private final WorkerTask task;
    private int failureCount = 0;
    private int restartCount = 1;
    private Date latestResultDelivery;
}

We handle the following list of possible failures 我们处理以下可能的故障列表

Worker fails the task by throwing an exception (ie IOException in your case) Worker通过抛出异常（即您的情况下为IOException）来完成任务

We deliver a new Failure(caughtException) message to the worker control. 我们向worker控件传递一个new Failure(caughtException)消息。 Upon seeing it worker control increments the failureCount and puts the task in the head of todoList queue. 看到它后，工作人员控制会增加failureCount并将任务放在todoList队列的头部。 When a given number of failures is reached the task is considered permanently failed and is never retried. 当达到给定数量的故障时，该任务被认为是永久失败的，并且从不重试。 (After that the permanently failed tasks can be logged, disposed, handled in a custom way). （之后，可以以自定义方式记录，处理和处理永久失败的任务）。

Worker does not deliver any result in a given period of time (eg he fell into an infinite loop, resource contention on the worker machine, worker mysteriously disappeared, task processing taking too long) 工人在给定的时间内没有提供任何结果 （例如，他陷入无限循环，工人机器上的资源争用，工人神秘地消失，任务处理花费太长时间）

We do two things for this 我们为此做了两件事

We initialize the latestResultDelivery field of the taskInfo and store the task assignment in the assignedTasks map. 我们初始化latestResultDelivery基于场taskInfo并存储在任务分配assignedTasks地图。
We periodically run a "health check" on the worker control that determines whether a worker has been working on a certain task for too long. 我们会定期对工作人员控制执行“健康检查”，以确定工人是否已经处理某项任务太长时间。

for (ActorRef busyWorker : assignedTasks.keySet()) {
        Date now = new Date();
        if (now.getTime()
                - assignedTasks.get(busyWorker).getLatestResultDeliveryTime() >= 0) {
            logger.warn("{} has failed to deliver the data processing result in time", nameOf(busyWorker));
            logger.warn("{} will be marked as dead", nameOf(busyWorker));
            getSelf().tell(new Failure(new IllegalStateException("Worker did not deliver any result in time")),
                    busyWorker);
            registeredWorkers.remove(busyWorker.path());
            deadWorkers.put(busyWorker.path(), busyWorker);
        }
    }

Network disconnects, worker process dying 网络断开，工人进程死亡

Again we do two things: 我们再做两件事：

Upon worker registration with the worker control we start watching the worker actor 在工人注册工人控制后，我们开始观察工人演员

 registeredWorkers.put(worker.path(), worker); registeredWorkers.put（worker.path（），worker）;\ncontext().watch(worker); 。上下文（）看（工人）;

If we receive a Terminated message in the worker control we increment the restartCount and return the task back to the todoList . 如果我们在worker控件中收到Terminated消息，我们递增restartCount并将任务返回给todoList 。 Again the task that has been restarted too many times eventually becomes a permanently failed and is never retried again. 重新启动太多次的任务最终会永久失败，永远不会再次重试。 This is done for a situation when the task itself becomes the cause of the remote worker death (eg remote system shutdown due to OutOfMemoryError). 这是在任务本身成为远程工作人员死亡的原因（例如由于OutOfMemoryError导致远程系统关闭）的情况下完成的。 We keep separate counters for failures and restarts to be able to better precise the retrying strategies. 我们为失败和重启保留单独的计数器，以便能够更好地精确重试策略。

We also do some attempts to be failure tolerant in the worker itself. 我们也尝试在工人本身中容忍失败。 Eg the worker controls the execution time of his tasks, and also monitors if he has been doing anything at all recently. 例如，工作人员控制他的任务的执行时间，并且还监视他最近是否一直在做任何事情。

Depending on the types of failures you need to handle you can implement a subset of the listed strategies. 根据您需要处理的故障类型，您可以实施所列策略的子集。

Bottom line: as it was mentioned in one of the comments: in order to get task rescheduling you will need to keep some data structure in your Fetcher that maps the workers and assigned tasks. 底线：正如其中一条评论中提到的那样：为了重新安排任务，您需要在Fetcher中保留一些数据结构，以映射工作人员和分配的任务。

Answer 2

Since nobody answered the question yet here is what I found so far. 由于没有人回答这个问题，这是我到目前为止所发现的。 It seems to me that for my case the Mailbox with Explicit Acknowledgement would be the good fit. 在我看来，对于我的情况，具有显式确认的邮箱将是合适的。 Here is how modified code would look like. 以下是修改后的代码的外观。

First, define peek-dispatcher and deployment for rssWorker in pee-dispatcher.conf file in classpath: 首先，在classpath中的pee-dispatcher.conf文件中为rssWorker定义peek-dispatcher和deployment：

peek-dispatcher {
  mailbox-type = "akka.contrib.mailbox.PeekMailboxType"
  max-retries = 10
}

akka.actor.deployment {
  /rssFetcher/rssWorker {
    dispatcher = peek-dispatcher
    router = round-robin
    nr-of-instances = 4
  }
}

Create ActorSystem using above config: 使用上面的配置创建ActorSystem：

ActorSystem system = ActorSystem.create("Akka", ConfigFactory.load("peek-dispatcher.conf"));

Fetcher pretty much stays as is, only creation of the Worker actors could be simplified as we defining router in config file Fetcher几乎保持不变，只有在我们在配置文件中定义路由器时才能简化Worker actor的创建

this.worker = getContext().actorOf(SpringExtension.SpringExtProvider.get(actorSystem).props("worker"), "worker");

Worker, on the other hand, would add extra line at the very end of processing to acknowledge message. 另一方面，工作人员会在处理结束时添加额外的行以确认消息。 In case of any error, message won't get acknowledged and would stay in Inbox to be redelivered again up to 'max-retries' times, as specified in config: 如果出现任何错误，消息将无法得到确认，并将保留在收件箱中，以便再次重新传送到'max-retries'次，如config中所指定：

class Worker extends UntypedActor {

  public void onReceive(Object message) throws Exception {
    if (message instanceof WorkUnit) {
      // fetch URL, parse etc
      // send result back to sender
      getSender().tell(new Result(...), null);
      // acknowledge message
      PeekMailboxExtension.lookup().ack(getContext());
    }
}

NOTE: I'm not sure that PeekMailboxExtension.lookup().ack(getContext()); 注意：我不确定PeekMailboxExtension.lookup（）。ack（getContext（））; is correct way to call acknowledge but it seems to work 是正确的方式来调用确认，但它似乎工作

This could also probably be combined with SupervisorStrategy.resume() for Workers - since Worker has no state it can just resume consumption of messages after error, I don't think there are any need to restart the Worker. 这也可能与Worker的SupervisorStrategy.resume（）结合使用 - 因为Worker没有状态它只能在错误后恢复消息消耗，我认为没有必要重新启动Worker。

Answer 3

In order to give the Fetcher the ability to know what was the failed message/task you can use actor preRestart akka build-in hook. 为了让Fetcher能够知道失败的消息/任务是什么，你可以使用actor preRestart akka内置钩子。

You can look here for details: http://alvinalexander.com/scala/understand-methods-akka-actors-scala-lifecycle 您可以在此处查看详细信息： http ： //alvinalexander.com/scala/understand-methods-akka-actors-scala-lifecycle

According to the Akka documentation, when an actor is restarted, the old actor is informed of the process when preRestart is called with the exception that caused the restart, and the message that triggered the exception. 根据Akka文档，当一个actor重新启动时，旧的actor会在调用preRestart时被告知进程，导致重启的异常以及触发异常的消息。 The message may be None if the restart was not caused by processing a message. 如果重新启动不是由处理消息引起的，则消息可以是None。

处理Akka演员异常的最佳实践

问题描述

3 个解决方案

解决方案1
1 2014-06-04 16:22:25

解决方案2
0 2014-06-04 01:38:15

解决方案3
0 2016-12-19 08:47:54

处理Akka演员异常的最佳实践

问题描述

3 个解决方案

解决方案1 1 2014-06-04 16:22:25

解决方案2 0 2014-06-04 01:38:15

解决方案3 0 2016-12-19 08:47:54

解决方案1
1 2014-06-04 16:22:25

解决方案2
0 2014-06-04 01:38:15

解决方案3
0 2016-12-19 08:47:54