Developing an Akka edge

Chapter 4: Handling Faults and Actor Hierarchy

Failures happen. It's a simple fact of life. In the context of our systems, this might be anything from network outages to drive failures or even simple errors in your application logic. The key is that we have to assume these events will occur. If we don't, we are guilty of simply ignoring reality.

When we think about failures, whether they are caused by exceptions, environmental factors or whatever, the most likely way we have learned to think about them is in terms of containment. For example, take the typical try/catch block that seems to be ubiquitous in nearly every popular language right now in whatever form it takes. This is all about containing these errors and failures and making sure they don't cause the rest of our system to come crashing down.

But there's a big issue that you have perhaps encountered. When these failures occur, we need to understand what state is our system in at the point the error happened and what do we need to do to make sure it's in a known-good state. That's not always an easy question to answer, particularly when we consider how exceptions effect the control-flow of your application.

Typical failure handling

It's worthwhile to spend some time reviewing the mechanisms Scala has built in for handling failure cases before stepping back and looking at what Akka brings to the mix. Some of this will perhaps be familiar, given that these same techniques exist in many other languages. But Scala brings a couple of nice additions to the mix.

The first of these is exception handling using try/catch/finally. Here's a simple example to make sure the concept is clear:

0001 try {
0002   val writer = new FileWriter("test.out")
0003   writer.write(bigBlockOfData)
0004 } catch {
0005   case e: FileOutputException => 
0006     println("Failed to write data.")
0007 } finally {
0008   writer.close()
0009 }

In this example, we see a case statement for an exception type that we know might be encountered and a finally block that will be executed whether the catch block is executed or not.

Another recent addition to the arsenal is Try[T], which allows for executing code that might be expected to result in an exception being thrown. The classes Success[T] and Failure[T], both of which extend Try, are the concrete instances which our code will receive depending on the success of the code passed to it. In the case of an exception, the Failure instance returned will contain the exception that was thrown. Similarly, if no exception was thrown, the Success instance will contain the final result of the code that was evaluated.

One of the benefits of Try is that it includes the map and flatMap operations, so we can use it in for comprehensions, chaining together operations and proceeding only on a successful execution. In comparison to deeply nested try/catch blocks, it should be apparent how much simpler this can make the code. Another reason to consider Try is that it encodes a possibly failing operation in the type system rather than having to make the choice to explicitly handle exceptions wherever they might occur or to delegate them up the call stack.

As a basic example of using Try, here we are retrieving the HTML source for Google's home page (the final output is truncated, of course):

0001 scala> import scala.util.Try; import scala.io.Source
0002 import scala.util.Try
0003 import scala.io.Source
0004 
0005 scala> for (g <- Try{Source.fromURL("http://google.com")})
0006      | yield g.getLines.foreach(println)
0007 <!doctype html><html...

Failures can also be handled using a simple match on the result:

0001 scala> import scala.util.{Failure, Success}
0002 import scala.util.{Failure, Success}
0003 scala> Try { Source.fromURL("http://imaginary.domain") } match {
0004  | case Success(result) => //...
0005  | case Failure(error) => println("Failed: " + error)
0006  | }
0007 Failed: java.net.UnknownHostException: imaginary.domain

There is Either[A,B] and there is Try[T] and there are differences between them.

A obvious difference is that Either does not specify what is considered a success and failure but leaves this notion by providing two types, Left and Right, to implement a convention of saying Left is used for success while Right is used for failure and this is in sharp contrast to Success[T] and Failure[T] which reflects the semantics of success and failure (which are sub-types of Try[T]).

A second difference is that you need to know the specific types for when it is a success and failure in the case of Either whilst Try demands that you know the type of when it is a success since failure is also some kind of Throwable and this need to know the specifics reflects in the way you use Either i.e. When composing Eithers, you need to make an explicit choice of the left / right via the methods left and right in Either. This poses a problem when you build and chain computations especially asynchronous computations where the reason of failure are aplenty and often you like to register callbacks to perform some action whenever a success or failure occurs e.g. Future[T] and Promise[T] and given that Try are a more general abstraction (and reads better), they are our preferred choice. Readers interested in more details, should consult the documentation.

Let it crash

The Akka engineering team often uses the catch-phrase let it crash (they actually borrowed it from the Erlang community, where it's sometimes phrased as let it fail). The idea is that failures should be accepted and handled appropriately. But how? Understanding the answer to this question is, in many ways, fundamental to truly understanding how to design with actors — at least, if the intent is to have the system be resilient to failures.

The primary idea to keep in mind is to isolate tasks that manipulate important data from those tasks that don't. This is easy to do with actors.

For instance, consider an actor that needs to maintain some local representation of a value, perhaps a current trading price for a shares of commodities. But this actor needs to periodically request updates to the value from a remote service. Calling the remote is risky. If the service is providing a RESTful mechanism for retrieving the current value, it might have to deal with any of a number of possible error conditions: connection timeouts, HTTP failure code responses, improper data encodings, expired access credentials, etc.

The right way to approach this is to separate the remote request-making portion of this task into a new actor. Depending upon the overall structure of the system, this actor might be instantiated and managed by the data-caching actor, or it might be handled by some other actor that has more general responsibilities for handling these kinds of remote requests (for example, it may also handle encoding and decoding of the data formats involved). Either way, that request-making actor is a child of some other actor. The parent of that child is also it's supervisor.

Fail fast

Another important idea is that of failing fast. That is, if a failure occurs, it's usually best to make sure the failure is recognized and acted on immediately, but also to allow the actors to fail when they encounter problems. Following the principle of keeping actors small and single purpose (sometimes referred to as the single-responsibility principle), it makes sense to not include a huge amount of logic around handling failures, but rather to let the actor fail and then either restart or recreate it and try again. This is where the topic of supervision comes into play.

Supervision

Supervisors are a key concept to understand and master in Akka. A poor understanding of them will almost certainly lead to unexpected behavior in the actor system. This might result in data disappearing that was expected to appear, requests to remote systems that shouldn't have occurred, or any of a number of other oddities.

Any actor in the system has at the very least a parent actor. At the topmost level are the special guardian actors mentioned earlier in the book (more details on these shortly). It's best to create one or more top-level actors that will then in any non-trivial system likely create additional actors as children. These children may even, in turn, create additional actors as needed. This tree of actors forms the actor hierarchy of the system.

Each actor acts as a supervisor for its children. What this means is that any unhandled errors that occur within the children are handled by the parent within a special structure called a SupervisorStrategy. The SupervisorStrategy for a given actor is one of two types: either a OneForOneStrategy or an AllForOneStrategy. The difference between these two types is what happens to the other child actors of the supervisor. In the case of the OneForOneStrategy only the failing actor has the response of the strategy applied. With an AllForOneStrategy, all of the failing actors sibling actors are also affected. OneForOneStrategy is very often the best choice, unless the collection of actors are closely interdependent in some way that requires action from all of them. E.g. If you had built a monitoring system where an actor, node-actor, spawns actors that monitor the available resources (i.e. cpu, ram, hdd etc) in a single node and let's assume node-actor spawns actors monitoring cpu, ram, hard drive, live-ness etc then you would apply the AllForOneStrategy (to stop all monitoring actors) if the live-ness actor that says that the node went down; extrapolating that situation and imagine that there are actors monitoring the 500 nodes in the data center environment i.e. 500 node-actor and you would probably apply a OneForOneStrategy to restart the monitoring when the node comes back up instead of the former strategy.

The supervisor strategy's main purpose is to take an the error that occured and translate it into an appropriate course of action, which is one of the following:

Resume — the actor should simply resume its operations, keeping all internal state
Restart — the actor should be restarted, resetting any internal state
Stop — the actor should simply be stopped
Escalate — the error should be escalated to the parent of the supervisor

No single one of these can be applied across the board to every case, so we need to determine which applies at any given time. Note that resume should only be used when it is certain that the code can continue without issue in its current internal state. Since restarts are such a common scenario, but necessarily need to be handled specially to avoid cascading failures, these strategies also take two initial parameters: the number of times an actor is allowed to restart and the window of time in which that count is applied. To be precise, if the number of restarts is set to a maximum of 5 in a 60 second window and the actor has already restarted 5 times in that 60 second window, the actor will simply be terminated if another failure occurs.

Let's look at an example of defining a simple strategy:

0001 import akka.actor.Actor
0002 import akka.actor.OneForOneStrategy
0003 import akka.actor.SupervisorStrategy._
0004 import scala.concurrent.duration._ 
0005 
0006 case class ExpectedHiccup(m: String) extends Exception(m)
0007 case class RemoteSystemFault(m: String) extends Exception(m) 
0008 class ChildActor extends Actor { 
0009   def receive = case x => throw RemoteSystemFault("fault!") 
0010 }  class SimpleSupervisor extends Actor {
0011   override val supervisorStrategy =
0012     OneForOneStrategy(maxNrOfRetries = 5, withinTimeRange = 60 seconds) {
0013       case _: ExpectedHiccup => Resume
0014       case _: RemoteSystemFault => Restart
0015     }
0016   
0017   def receive = {
0018     case msg => context.actorFor("child") ! msg // deliver the message to the child
0019   }
0020   override def preStart() : Unit = {
0021     context.actorOf(Props[ChildActor], "child")// start the child when supervisor starts
0022   }
0023 }

What we have here is really an actor, SimpleSupervisor, whom has its own strategy to implement i.e. we override supervisorStrategy with an implementation looking out for RemoteSystemFault & ExpectedHiccup, and this actor starts / spawns another actor, ChildActor, and a message delivered to the supervisor actor i.e. SimpleSupervisor would get delivered to the ChildActor which in turn throws an RemoteSystemFault. What happens next is that the child actor is restarted.

It's important to point out what's not in this strategy: quite a bit. There are certainly many other possible exceptions that might occur — it's impossible to say which they might be. In these unhandled cases, those errors are escalated to this actor's supervisor.

Another important point is that, in the case where no strategy is defined, Akka will use the default strategy. This is actually defined as one of two system strategies: SupervisorStrategy.defaultStrategy and SupervisorStrategy.stoppingStrategy.

In the default strategy, the following cases are handled: an ActorInitializationException, which is thrown when an actor's initialization logic fails, or an ActorKilledException, which is thrown when an actor receives an akka.actor.Kill message. Both result in the actor being stopped; any other Exception instance will cause the actor to be restarted, and any other instance of Throwable will be escalated.
The SupervisorStrategy.stoppingStrategy will simply stop any failing child actor. Note that any grand-child actors or below that will also be stopped.

Both of these pre-defined strategies are of type OneForOneStrategy, with no limits specified for the maximum number of restarts and no window defined. Given that, it's good to think about how failures will be handled there is no defined supervisor. This can easily result in a system spiraling out of control, given that a generic exception will cause the actor to get restarted. If this exception occurs every time the actor is executed, it will be spinning in-place with potentially disasterous results.

The actor-lifecycle

While this applies to more than just failure handling, it's worth briefly discussing the lifecycle of an actor in Akka. As we've already seen, actors typically begin life with a call to actorOf(Props[SomeActor]). Akka starts the actor when it is first created, and as with any other object in Scala, initialization code can be placed within the constructor (the body of the class definition). We can also insert code to be run immediately before the actor is actually started, but after the object has been created, using the preStart hook. A common pattern with actors is to have the actor send itself a message when it starts to let it know to initiate some process (for instance, scheduling work using the scheduler, described in Appendix B). Akka also provides the ability to perform cleanup, as necessary, using the postStop hook. It's important to note that, at this point, the actor's mailbox will be unavailable.

0001 import akka.actor.Actor 
0002 
0003 case object Initialize 
0004 
0005 class SelfInitializingActor extends Actor { 
0006   override def preStart {
0007     super.preStart // empty implementation in the base type or class
0008     self ! Initialize
0009   }
0010 
0011   override def postStop {
0012     // perform some cleanup or just log the state of things
0013     super.postStop // empty implementation in the base type or class
0014   }
0015 
0016   def receive = {
0017     case Initialize => {
0018       // perform necessary initialization
0019     }
0020   }
0021 }

The most important hooks, though, when it comes to handling failures within an actor, are provided to allow for handling of additional tasks needed when restarts occur. The preRestart and postRestart methods both get passed the Throwable instance that caused the restart. In the preRestart case Akka also passes the most recent message from the actors message queue that caused the exception to occur. Note that postRestart normally calls the preStart method, so if overriding postRestart, we will need to call that directly (or call super.postRestart with the same parameters) if our code depends on that hook, as well, particularly if we are depending on preStart for initialization.

You will typically use these restart hooks to handle cleanup chores in failure scenarios. A good example would be when there is some interdependent resource that needs to know when the actor is available:

0001 import akka.actor.{ Actor, ActorRef } 
0002 
0003 case class Available(ref: ActorRef)
0004 case class Unavailable(ref: ActorRef)
0005  
0006 class CodependentActor(dep: ActorRef) extends Actor { 
0007   override def preStart {
0008     super.preStart
0009     dep ! Available(self)
0010   } 
0011 
0012   override def preRestart(reason: Throwable,
0013                           message: Option[Any]) {
0014     super.preRestart // Default implementation is to stop and unwatch all "child" actors
0015     dep ! Unavailable(self)
0016   } 
0017 
0018   // this overriden implementation is not really
0019   // needed, but it's here to show you the form
0020   override def postRestart(reason: Throwable) {
0021     preStart()
0022   } 
0023 }

The other key mechanism available as part of the whole actor lifecycle system is the so-called DeathWatch, which provides a means to be notified when actors fail or when a particular actor has been stopped permanently (that is, restarts don't count). In order to make use of this, an actor registers its interest in being notified by calling context.watch on a reference to the target actor. When that target actor is shut down, the DeathWatch sends a Terminated message, which includes the ActorRef for the deceased actor. It's also possible to receive multiple Terminated messages for a single actor. This mechanism is very useful when you need to have the failure of one actor trigger some other action, but be sure to use it carefully.

0001 import akka.actor.{Actor, Props, Terminated} 
0002 
0003 case class Register(ref: ActorRef) 
0004 class MorbidActor extends Actor {
0005   def receive = {
0006     case Register(ref) => context.watch(ref)
0007     case Terminated(ref) =>
0008   }
0009 }

Understanding the actor lifecycle is an important factor for designing robust actor systems. The dependencies between the components of an actor system should be built in such a way that the lifecycle of each individual component is considered as part of the overall picture.

A bit more about the hierarchy

It's worth talking a bit more about the actor hierarchy and what, in particular, sits above the top-most actors. There are three special actors known as guardians, which are internal to Akka. The one most often seen reference to is the user guardian. The user guardian is handles is responsible for handling any errors that percolate up through the actor hierarchy and which are not handled by any explicit supervisors lower in the tree. It normally implements default strategy described above, but that can be overridden as of Akka 2.1 by overriding the setting akka.actor.guardian-supervisor-strategy. To specify a different strategy, set this to the fully-qualified pathname of a class that implements akka.actor.SupervisorStrategyConfiguration. Since this is rarely needed, so we will leave this as an exercise for the reader.

The other guardians to be aware of are the system and the root guardians. The system guardian is responsible for certain system-level services that need to be available before any user-level actors are started and that need to still be running after any user-level actors are shutdown. One instance of this would be the logging actors that reside under the system guardian. The order of startup and shutdown of the guardians in Akka provide this feature. The startup order is root, followed by the system, followed by user. The inverse is used for shutdown.

The root guardian resides at the very top of the actor hierarchy. Its primary purpose is to handle faults that escalate from the user or system guardians and a few other special root-level actors, with the default action being to stop those children. This guardian, being an actor, still needs a supervisor and it has one: a special pseudo-actor that will stop the child (root guardian) on any failure that reaches this level. Before the root guardian is stopped, the hierarchy is walked, recursively stopping all children, after which the isTerminated flag of the ActorSystem is set to true. The Akka team refers to this supervisor as "the one who walks the bubbles of space-time" — this is a reference to the recursive nature of the supervisor hierarchy and the fact that this special supervisor needs to exist outside of the "bubble of space-time" of the ActorSystem in order to avoid the issue of infinite recursion.

Guidelines for handling failures

Now that we've seen how Akka addresses exceptions and errors, we can cover some general principles that are good to follow. These guidelines won't fit every scenario, but they are appropriate to use as a goal and careful thought should be given when straying from this path.

We've already discussed the let it crash philosophy. But this idea is important enough to reinforce: the system should be designed to allow for failure by isolating failure-prone operations in distinct actors. This allows for a lot of flexibility when dealing with and accomodating these failures. As we'll see when we cover routers, this approach also pairs well with pooled actors. When a single actor in the pool fails, we can still quickly retry the operation, getting either another actor from the pool or possibly even the same actor after a restart.

One technique for isolating failures is to use what's know as the error kernel pattern. This pattern is only possible because of the supervisor hierarchy in the actors of Akka since it provides you the means to delegate tasks to child actors using a variety of ways e.g. routers (you'll learn that in the next chapter) and supervisor strategies.

The key idea with the error kernel pattern is that we are localizing the specific failure handling where it makes sense, near the failure site, while allowing critical failures that might require more broadly scoped action to escalate as necessary. A typical example of this would be interaction with some external system, such as a database or webservice. In a normal operation you would expect some amount of failures interacting with these systems, and those failures should be isolated using the error kernel pattern or similar localized supervisors.

In this pattern, you typically create a new actor to handle some unit of work that has a potential failure case that should be guarded against, while allowing other work to continue. An example of this in action follows:

0001 import akka.actor.{OneForOneStrategy, Props, Actor}
0002 import akka.actor.SupervisorStrategy.{Escalate, Restart}
0003 import scala.concurrent.duration._
0004 case class Payload()
0005 case class CompletedWork()
0006 case class UnableToPerformWorkException() extends Exception()
0007 class ErrorKernelExample extends Actor {
0008 
0009   override val supervisorStrategy =
0010     OneForOneStrategy(maxNrOfRetries = 5, withinTimeRange = 15 seconds) {
0011       case _: UnableToPerformWorkException => Restart
0012       case _: Exception ⇒ Escalate
0013     }
0014 
0015   def receive = {
0016     case work: Payload ⇒
0017       context.actorOf(Props[Worker]) forward work
0018   }
0019 }
0020 class Worker extends Actor() {
0021   def receive = {
0022     case work: Payload ⇒
0023       sender ! doSomework(work)
0024       context.stop(self)
0025     }
0026     def doSomework(work: Payload) = {
0027       // process the work and then return a success message
0028       CompletedWork
0029     }
0030   }
0031 } 
0032

In this example, the ErrorKernelExample actor is using a custom SupervisorStrategy that will restart the child actor when it throws an UnableToPerformWorkException on up to five occurences within a 15 second interval. The actor itself waits for the Payload message, which indicates that some work needs to be performed, and creates an instance of the Worker actor, forwarding the Payload message to it. By forwarding the message, the child actor is then able to reply to the original sender who requested the work be performed directly when it has completed successfully. Structuring the work this way allows the ErrorKernelExample to remain available for further work even in the case of error occuring while performing the work. However, in this case, we wanted to know to finally give up when the failure rate is too high, hence the settings given for the strategy.

We should also consider dedicated supervisors in some cases, as an additional means of isolation. A typical approach is to have a single supervisor with its children that's playing a particular role in our system since that helps developers create logical abstractions and to reason about them. This is really just another form of the error kernel pattern shown above, but using a slightly different approach. Instead of creating anonymous actors to isolate dangerous work, we create normal, non-anonymous actors and create special actors to be their parents, but whose sole purpose is to supervise its children. We'll see an example of this in the context of our earlier example from the second chapter.

First, we'll create a new actor to supervise our BookmarkStore actors. This is a very simplified example, but it should provide an idea of how we can approach this:

0001 import akka.actor.{ActorRef, OneForOneStrategy, Actor}
0002 import akka.actor.SupervisorStrategy.Restart
0003 import scala.concurrent.duration._
0004 class BookmarkStoreGuardian(database: Database[Bookmark, UUID]) extends Actor {
0005 
0006  override val supervisorStrategy =
0007    OneForOneStrategy(maxNrOfRetries = 5, withinTimeRange = 30 seconds) {
0008      case _: Exception => Restart
0009    }
0010 
0011   val bookmarkStore =
0012     context.actorOf(Props(classOf[BookmarkStore], database).
0013       withRouter(RoundRobinRouter(nrOfInstances = 10)))
0014   
0015   def receive = {
0016     case msg => bookmarkStore forward msg
0017   }
0018 }

This actor, implementing the idea of error kernel pattern, simply forwards the messages on to the BookmarkStore actors which it places behind a RoundRobinRouter (we'll see more of in the next chapter). The overriden supervisorStrategy is very simple here. As with the example used in Chapter 2, it simply restarts the actors on any failure until it has exceeded 5 failures within 30 seconds. If those limits are exceeded, the failures will be escalated (in this case, as we'll see below, up to the top of our ActorSystem, resulting in the system shutting down).

Here's the revised form of our Bookmarker application that initializes all of this.

0001 import org.eclipse.jetty.server.Server
0002 import org.eclipse.jetty.servlet.{ServletHolder, ServletContextHandler}
0003 import java.util.UUID
0004 import akka.actor.{Props, ActorSystem}
0005 import akka.routing.RoundRobinRouter
0006 
0007 object Bookmarker extends App {
0008 
0009   val system = ActorSystem("bookmarker")
0010 
0011   val database = Database.connect[Bookmark, UUID]("bookmarkDatabase")
0012 
0013   val bookmarkStoreGuardian =
0014     system.actorOf(Props(classOf[BookmarkStoreGuardian], database))
0015 
0016   val server = new Server(8080)
0017   val root = new ServletContextHandler(ServletContextHandler.SESSIONS)
0018   root.addServlet(new ServletHolder(new BookmarkServlet(system, bookmarkStoreGuardian)), "/")
0019 
0020   server.setHandler(root)
0021   server.start
0022   server.join
0023 }

The only significant change here is the addition of the code that starts our BookmarkStoreGuardian. Take note of how we pass this guardian actor to our servlet instead of the previous routed pool of BookmarkStore actors. Since the guardian is simply forwarding messages down to the underlying actors, this works without having to change our servlet code at all.

Wrap-up

Akka gives us powerful techniques for fault handling, but it also requires designing with failure in mind. But that's something we should be doing anyway. In the next chapter, we'll begin looking at additional structures for handling the flow of messages and allocation of work in our system using routers and dispatchers.