Fault Tolerance 4.0

SmallRye Fault Tolerance is our implementation of Eclipse MicroProfile Fault Tolerance. It was originally based on Hystrix, the Netflix library for latency and fault tolerance in distributed systems. Lately, we realized that Hystrix is not the best fit for implementing MicroProfile Fault Tolerance, because Hystrix meshes all fault tolerance concerns into one, while MicroProfile Fault Tolerance describes mostly a layered architecture. The MicroProfile Fault Tolerance specification also requires certain features that Hystrix intentionally doesn’t provide; most importantly, the ability to interrupt threads that Hystrix itself didn’t create. Last but not least, Hystrix is in maintenance mode, and hasn’t been actively developed for more than a year. When we tried to update to latest Hystrix version, we even faced breaking changes.

We have been talking on and off about removing Hystrix and using our own implementation of core fault tolerance strategies, one that would be better suited as a base for implementing MicroProfile Fault Tolerance, but only a few months ago did we finally start working on it. Thanks to heroic work of Michał Szynkiewicz, the new implementation didn’t take long to pass the entire MicroProfile Fault Tolerance TCK. We have already released a first version of this new implementation, called 4.0.0. Of course, that’s not the end; in fact, it’s just a beginning. We should now have a better foundation to experiment with, and already have several ideas that we would like to explore and eventually propose to MicroProfile Fault Tolerance.

In this article, we’ll take a brief look at how the new implementation works inside, what do you need to know as an integrator, and what are our plans for the future.

Under the hood

First, let’s take a brief look at the core. It is based on the idea of FaultToleranceStrategy, which is an interface that looks like this:

interface FaultToleranceStrategy<V> {
    V apply(InvocationContext<V> ctx) throws Exception;
}

The InvocationContext is a Callable that represents the method invocation guarded by this fault tolerance strategy. The fault tolerance strategy does its work around ctx.call(). It can catch exceptions, invoke ctx.call() multiple times, invoke something else, etc. As an example, let’s consider this strategy, applicable to methods that return a String:

public class MyStringFallback implements FaultToleranceStrategy<String> {
    @Override
    public String apply(InvocationContext<String> ctx) {
        try {
            return ctx.call();
        } catch (Exception ignored) {
            return "my string value";
        }
    }
}

This is a very simple fallback mechanism, which returns a pre-defined value in case of an exception.

In the SmallRye Fault Tolerance codebase, you can find implementations of all the strategies required by MicroProfile Fault Tolerance: retry, fallback, timeout, circuit breaker or bulkhead. Asynchronous invocation, delegated to a thread pool, is of course also supported.

When multiple fault tolerance strategies are supposed to be used to guard one method, they form a chain. Continuing with our simple example, adding the ability to chain with another strategy would look like this:

public class MyStringFallback implements FaultToleranceStrategy<String> {
    private final FaultToleranceStrategy<String> delegate;

    public MyStringFallback(FaultToleranceStrategy<String> delegate) {
        this.delegate = delegate;
    }

    @Override
    public String apply(InvocationContext<String> ctx) {
        try {
            return delegate.apply(ctx);
        } catch (Exception ignored) {
            return "my string value";
        }
    }
}

We see that one strategy delegates to another, passing the InvocationContext along. In fact, all the implementations in SmallRye Fault Tolerance are written like this: they expect to be used in a chain, so they take another FaultToleranceStrategy to which they delegate. But if all strategies have this form, when is ctx.call() actually invoked? Good question! The ultimate ctx.call() invocation is done by a special fault tolerance strategy which is called, well, Invocation.

As an example which uses real MicroProfile Fault Tolerance annotations, let’s consider this method:

@Retry(...)
@Timeout(...)
@Fallback(...)
public void doSomething() {
    ...
}

The chain of fault tolerance strategies will look roughly like this:

Fallback(
    Retry(
        Timeout(
            Invocation(
                // ctx.call() will happen here
                // that will, in turn, invoke doSomething()
            )
        )
    )
)

The order in which the strategies are chained (or, in fact, nested) is specified by MicroProfile Fault Tolerance.

In asynchronous invocations, especially those using CompletionStage, it is a bit more complex, but the core idea is the same and the intuition you gained here should still apply.

Please note that everything explained here is an implementation detail. API stability for these interfaces and classes is not guaranteed; we only mention them here to give you better understanding of the new implementation. You will also want to understand this if you decide to contribute (pull requests are always welcome!).

Integration concerns

If you integrate SmallRye Fault Tolerance to provide an implementation of MicroProfile Fault Tolerance, you should be aware of a few things.

SmallRye Fault Tolerance creates several thread pools (ExecutorServices). One for asynchronous invocations, one for watching timeouts, and then one for each threadpool-style bulkhead. You might want to customize how these thread pools are created. For example, if you integrate SmallRye Fault Tolerance into a Java EE application server, you probably want these thread pools to be managed. We provide an SPI that you can implement for this exact purpose: ExecutorFactory.

Size of these thread pools can be configured using the following configuration properties:

io.smallrye.faulttolerance.globalThreadPoolSize

Size of the thread pool for asynchronous invocations. Does not include bulkhead thread pools. Defaults to 100.

io.smallrye.faulttolerance.timeoutExecutorThreads

Size of the thread pool for watching timeouts. Defaults to 5.

Optional integration with MicroProfile Context Propagation is present, in a separate artifact. Optional integration with MicroProfile OpenTracing is also present, but it’s currently built on top of the Context Propagation integration. We’re looking to potentially remove that dependency, so that OpenTracing integration is also possible if you’re not yet ready to incorporate Context Propagation.

Future outlook

As we mentioned above, this is not an end. Here’s a short list of what we want to look at.

First of all, the 4.0 version implements MicroProfile Fault Tolerance 2.0. We will of course continue implementing subsequent specification versions, starting with 2.1, which should be released soon.

Currently, MicroProfile Fault Tolerance specifies that @Asynchronous invocation always means delegating to a thread pool. We’d like to investigate what would it take to support asynchronous invocation on an event loop, such as one provided by Vert.x.

We’d like to add a fail-fast strategy, tentatively called @FailFast. This is useful to prevent expensive calls when you know it doesn’t make sense to do them. One instance of this idea is the circuit breaker, which decides automatically based on rate of errors. Here, the decision would be entirely yours.

We’d also like to add adaptive concurrency limiters, tentatively called @AdaptiveBulkhead. Bulkheads, as they exist in MicroProfile Fault Tolerance now, are defined statically. That is, you need to know upfront what is the maximum concurrency level. This is no longer enough in the cloud world, where services are scaled up and down dynamically. It is possible to determine the concurrency limits dynamically, by observing the invocation latencies and error rates. Netflix has been doing that in their concurrency limits project, and we can certainly take a lot of inspiration from them.

Some of the items above are about providing a new API. The idea is that we would prototype that API in SmallRye and if it proves worthy, we’d work to specify it in MicroProfile. For that reason, we would probably mark these prototypes as experimental API, and if they get specified in MicroProfile, we’d rapidly deprecate and remove them.

There are also some implementation aspects that we’d like to finetune.

The metrics collection and configuration handling code mostly comes from the old implementation, and is due for serious refactoring.

As mentioned above, each threadpool-style bulkhead currently gets a fresh thread pool. This is quite inefficient, and can be significantly improved.

The current implementation includes a special strategy called Tracer, which can be put in between any two other strategies to print useful information about when and on which thread is the subsequent strategy invoked. This proved very useful in debugging, but using it requires manual changes and rebuild of the entire codebase. We want to add comprehensive debug and trace logging that is always present, so that you can easily enable it.

And there’s a lot more, for sure. As we said above, this is just a beginning!