Best practices for dealing with exceptions in Akka actors

Question

I have the following task, for which I have Java/Executors solution working well but I'd like to implement same functionality in Akka and looking for best practices suggestions.

Problem:

Fetch/parse data from multiple URLs in parallel, block till all data to be fetched and return aggregated result. Should retry on errors (IOException etc) up to certain number of times.

My implementation so far is pretty straightforward - create Fetcher actor which knows what URLs should be fetched, it creates bunch of Worker actors and send them URLs, one per message. Once done with particular URL Worker send message back to Fetcher with a result. Fetcher keeps state of results, Workers stateless. Simplified code below.

Fetcher:

class Fetcher extends UntypedActor {
  private ActorRef worker;

  public void onReceive(Object message) throws Exception {
    if (message instanceof FetchMessage) {
      this.worker = context().actorOf(SpringExtension.SpringExtProvider.get(actorSystem).props("Worker")
              .withRouter(new RoundRobinPool(4)), "worker");
      for(URL u: urls) {
        this.worker.tell(new WorkUnit(u), getSelf());
      }
   }
   else if (message instanceof Result) {
     // accumulate results
   }
}

Worker:

class Worker extends UntypedActor {

  public void onReceive(Object message) throws Exception {
    if (message instanceof WorkUnit) {
      // fetch URL, parse etc
      // send result back to sender
      getSender().tell(new Result(...), null);
    }
}

So far so good and in absence of exceptions everything works as expected.

But if there is say IOException in fetching URL in Worker then Akka would restart Worker actor but message that Worker was processing at the time is lost. Even if I use different SupervisorStrategy the result is the same - some of the messages are effectively 'lost'. Of course I could have wrapped code inside Worker.onReceive() with try/catch but I feel that this goes against Akka philosophy. I guess I could use persistent messaging but I don't think added complexity of message persistence is justified in this case.

I need perhaps some way for a Fetcher to figure out that Worker failed on fetching some of the URLs and resend WorkUnit again or detect that some Results are not coming back for too long. What would be the best approach to handle this case?

Thanks,

Answer 1

We had a similar problem in our project and we found a solution which works for us - the tasks are executed regardless exceptions, worker failures, network failures etc. Although I must admit that the code eventually became a bit complicated.

So our setup is the following:

There is a WorkerControl actor that handles the task management and communication with the workers
There is a number of Worker actors that live in a different VM (potentially on different physical machines)
WorkerControl receives some data to be processed and dispatches the tasks between the workers

More or less we tried to follow the guidelines described here

But we also improved the failure tolerance of the design.

In the WorkerControl we keep the following data structures:

Map<ActorPath, ActorRef> registeredWorkers // registry of workers
Deque<TaskInfo> todoList                   // tasks that have not been yet processed
Map<ActorRef, TaskInfo> assignedTasks      // tasks assigned to the workers
Map<ActorPath, ActorRef> deadWorkers       // registry of dead workers

For each task to be executed we keep a data structure

class TaskInfo {
    private final WorkerTask task;
    private int failureCount = 0;
    private int restartCount = 1;
    private Date latestResultDelivery;
}

We handle the following list of possible failures

Worker fails the task by throwing an exception (ie IOException in your case)

We deliver a new Failure(caughtException) message to the worker control. Upon seeing it worker control increments the failureCount and puts the task in the head of todoList queue. When a given number of failures is reached the task is considered permanently failed and is never retried. (After that the permanently failed tasks can be logged, disposed, handled in a custom way).

Worker does not deliver any result in a given period of time (eg he fell into an infinite loop, resource contention on the worker machine, worker mysteriously disappeared, task processing taking too long)

We do two things for this

We initialize the latestResultDelivery field of the taskInfo and store the task assignment in the assignedTasks map.
We periodically run a "health check" on the worker control that determines whether a worker has been working on a certain task for too long.

for (ActorRef busyWorker : assignedTasks.keySet()) {
        Date now = new Date();
        if (now.getTime()
                - assignedTasks.get(busyWorker).getLatestResultDeliveryTime() >= 0) {
            logger.warn("{} has failed to deliver the data processing result in time", nameOf(busyWorker));
            logger.warn("{} will be marked as dead", nameOf(busyWorker));
            getSelf().tell(new Failure(new IllegalStateException("Worker did not deliver any result in time")),
                    busyWorker);
            registeredWorkers.remove(busyWorker.path());
            deadWorkers.put(busyWorker.path(), busyWorker);
        }
    }

Network disconnects, worker process dying

Again we do two things:

Upon worker registration with the worker control we start watching the worker actor
```
 registeredWorkers.put(worker.path(), worker); \ncontext().watch(worker);  
```
If we receive a Terminated message in the worker control we increment the restartCount and return the task back to the todoList . Again the task that has been restarted too many times eventually becomes a permanently failed and is never retried again. This is done for a situation when the task itself becomes the cause of the remote worker death (eg remote system shutdown due to OutOfMemoryError). We keep separate counters for failures and restarts to be able to better precise the retrying strategies.

We also do some attempts to be failure tolerant in the worker itself. Eg the worker controls the execution time of his tasks, and also monitors if he has been doing anything at all recently.

Depending on the types of failures you need to handle you can implement a subset of the listed strategies.

Bottom line: as it was mentioned in one of the comments: in order to get task rescheduling you will need to keep some data structure in your Fetcher that maps the workers and assigned tasks.

Answer 2

Since nobody answered the question yet here is what I found so far. It seems to me that for my case the Mailbox with Explicit Acknowledgement would be the good fit. Here is how modified code would look like.

First, define peek-dispatcher and deployment for rssWorker in pee-dispatcher.conf file in classpath:

peek-dispatcher {
  mailbox-type = "akka.contrib.mailbox.PeekMailboxType"
  max-retries = 10
}

akka.actor.deployment {
  /rssFetcher/rssWorker {
    dispatcher = peek-dispatcher
    router = round-robin
    nr-of-instances = 4
  }
}

Create ActorSystem using above config:

ActorSystem system = ActorSystem.create("Akka", ConfigFactory.load("peek-dispatcher.conf"));

Fetcher pretty much stays as is, only creation of the Worker actors could be simplified as we defining router in config file

this.worker = getContext().actorOf(SpringExtension.SpringExtProvider.get(actorSystem).props("worker"), "worker");

Worker, on the other hand, would add extra line at the very end of processing to acknowledge message. In case of any error, message won't get acknowledged and would stay in Inbox to be redelivered again up to 'max-retries' times, as specified in config:

class Worker extends UntypedActor {

  public void onReceive(Object message) throws Exception {
    if (message instanceof WorkUnit) {
      // fetch URL, parse etc
      // send result back to sender
      getSender().tell(new Result(...), null);
      // acknowledge message
      PeekMailboxExtension.lookup().ack(getContext());
    }
}

NOTE: I'm not sure that PeekMailboxExtension.lookup().ack(getContext()); is correct way to call acknowledge but it seems to work

This could also probably be combined with SupervisorStrategy.resume() for Workers - since Worker has no state it can just resume consumption of messages after error, I don't think there are any need to restart the Worker.

Answer 3

In order to give the Fetcher the ability to know what was the failed message/task you can use actor preRestart akka build-in hook.

You can look here for details: http://alvinalexander.com/scala/understand-methods-akka-actors-scala-lifecycle

According to the Akka documentation, when an actor is restarted, the old actor is informed of the process when preRestart is called with the exception that caused the restart, and the message that triggered the exception. The message may be None if the restart was not caused by processing a message.

Best practices for dealing with exceptions in Akka actors

Question

3 answers

solution1
1 2014-06-04 16:22:25

solution2
0 2014-06-04 01:38:15

solution3
0 2016-12-19 08:47:54

Best practices for dealing with exceptions in Akka actors

Question

3 answers

solution1 1 2014-06-04 16:22:25

solution2 0 2014-06-04 01:38:15

solution3 0 2016-12-19 08:47:54

solution1
1 2014-06-04 16:22:25

solution2
0 2014-06-04 01:38:15

solution3
0 2016-12-19 08:47:54