The How Of Saga Resiliency

July 10, 2023 · 8 min read

When multiple parts of your application connect via asynchronous processes such as event handlers and commands, ACID transactions are nowhere to be found. You need to make sure your asynchronous processes are resilient enough to withstand failure by making the right decisions and creating the correct implementation. During my time as a Solution Architect with AxonIQ, I have helped many customers achieve the most resilient solution for their processes and Sagas. So how do we achieve resilient systems?

Processes in Axon Framework

Axon Framework applications often consist of one or more aggregates. Whenever these interact with each other, or when one interacts with the outside world, you want that to be done asynchronously through event processors. This is because aggregates are consistent and create a lock during command handling. If you then execute lengthy processes or external calls, this locks your aggregate and can frustrate your users trying to do something.

Besides frustrating your user, executing this logic from the aggregate causes you to need to think about error compensation when infrastructure errors occur, for example if the secondary command cannot be executed. This riddles your domain logic with infrastructure concerns.

We therefore move these processes to stateless @EventHandler methods, or stateful @SagaEventHandler methods if we need to keep state between different events. These events then dispatch a command to a different aggregate, or interact with the external system and send a command back to the originating aggregate. You can find out more in the reference guide, but for now we will use this simple example for the rest of the blog:

@EventHandler
public void handleEvent(AuctionWonEvent event) {
    commandGateway.send(new DeductBalanceFromBankAccountCommand(
        event.getWinner(),
        event.getAmount()
    );
}

This sample is the most basic and non-resilient option there is. It provides at-most-once delivery semantics as it sends the event without waiting for a response. There is no action if the command fails, so if that happens the process will be disrupted.

The best (and most expensive) resiliency

We can wait for the response in the same thread as the event processor. This will block the event processor thread, but ensure we process the response in the same transaction as the token update to the database. If it fails or times out, the event processor will retry it (unless you are using thedefault LoggingErrorHandler) indefinitely (or put it into the Dead-Letter Queue if you configured that). You can see that in the following sample:

@EventHandler
public void handleEvent(AuctionWonEvent event) {
    commandGateway.sendAndWait(new DeductBalanceFromBankAccountCommand(
        event.getAccountId(),
        event.getAmount()
    );
}

This switches the semantics to at-least-once delivery. Make sure the component receiving the command is idempotent, as you could receive the command twice, thrice, or 200 times depending on retries. It's smart to surround the sendAndWait with a try-catch statement in which you define what to do when the error occurs.

As already said, this will block the event processor threads. If you have many events within the same segment, this can cause a delay in the asynchronous process as they are handled sequentially.

Asynchronous resiliency

We can also adopt the first sample to include resiliency asynchronously, as it returns a CompleteableFuture. Take a look at the following sample:

@EventHandler
public void handle(AuctionWonEvent event) {
    var command = new DeductBalanceFromBankAccountCommand(
        event.getAccountId(),
        event.getAmount()
    );
    commandGateway.send(command)
            .exceptionally(throwable -> {
                System.out.println("Exception occurred: " + throwable.getMessage());
                commandGateway.send(new AuctionBalanceActionFailedCommand(...));
                return null;
            });
}

This already makes our first sample reliable, right? Unfortunately it's still not resilient enough. Ask yourself the following questions:

What happens if the JVM goes down between dispatching the command and handling its response?
What if the AuctionBalanceActionFailedCommand also fails? And what if the command correcting that fails?
What if the command fails, but the network connection is interrupted and the response is not received?

After considering these questions, I think we can conclude that this solution is still not ideal. It's still reasonably possible for processes to get interrupted without a way to recover them manually. We can improve this a bit through deadlines.

Full resiliency

A deadline in Axon Framework is a trigger based on time. It acts like a command and can target a Saga or Aggregate, so this approach is not suitable for regular event handlers. However, this can help us cope with the possibility of the JVM going down before processing a command's response. At this point in time, it's the most reliable pattern out there. What you do is:

Define a command handler which always has an event as functional output
A Success event
A failed event (e.g. if validation fails)
Dispatch the command using the regular send method, so it's asynchronous
Schedule a Deadline to trigger in a reasonable timeframe you expect the command to succeed (e.g. 30 seconds)
Take compensating action when the deadline or failed event is called in the Saga

This way, no matter what the reason, compensatory action is always taken, no matter the cause. So how does this look? First, this is what the @CommandHandler of the desination aggregate would look:

@CommandHandler
public void handle(DeductBalanceFromBankAccountCommand cmd) {
    if(balance < cmd.getBalance()) {
        AggregateLifecycle.apply(new BalanceDeductionFailedEvent(
            this.accountId,
            "balance not high enough")
        );
    }
    AggregateLifecycle.apply(new BalanceDeductedEvent(
        this.accountId,
        cmd.getBalance())
    );
}

As you can see, we don't throw exceptions so the Saga can always react to an event. This keeps your Saga progressing fast with very small actions, instead of blocking the thread for an extended amount of time awaiting a potential error.

So how do we handle this in the Saga? Let's take a look.

@SagaEventHandler(associationProperty = "auctionId")
public void handle(AuctionWonEvent event, DeadlineManager deadlineManager) {
    var command = new DeductBalanceFromBankAccountCommand(
            event.winner,
            event.amount,
            event.auctionId
    );
    // Schedule a deadline for 30 seconds from now
    Instant expiry = Instant.now().plus(30, ChronoUnit.SECONDS);
    this.deadlineId = deadlineManager.schedule(expiry, "deduction-failed");
    commandGateway.send(command);
}

@SagaEventHandler(associationProperty = "auctionId")
public void handle(BalanceDeductedEvent event, DeadlineManager deadlineManager) {
    // Success! Remove future deadline
    deadlineManager.cancelSchedule("deduction-failed", deadlineId);
    // Do more business stuff
}

@SagaEventHandler(associationProperty = "auctionId")
public void handle(BalanceDeductionFailedEvent event) {
    // Do something with the failure
    handleDeductionFailure(event.reason);
}

@DeadlineHandler(deadlineName = "deduction-failed")
public void handleDeductionDeadline() {
    // Do something with the expired deadline
    handleDeductionFailure("Deadline expired");
}

private void handleDeductionFailure(String reason) {
    // Take compensating action
}

It's some work, but it is the most resilient way to create your inter-aggregate processes! This is why I always emphasize you need a good aggregate design. Don't make them too large, as they will give performance problems, but making them too small will make them very chatty. And as you probably remember from the telephone game in high school, this is very hard to do right.

Pitfalls of asynchronous commands

Keeping everything asynchronous is great. But if there is a backup down the line, the pipes will become full and eventually overflow. Axon Server can hold 10.000 commands in its queue before it starts to reject them. As commands are not sourced, this can become a source of problems. By taking the responsibility of synchronisation off of event processor threads, you put it on this queue which is not persisted. You can overwhelm your system and potentially lose commands. This is a problem for resiliency.

A better solution

I think we can improve the amount of work and thinking that are needed to make a process resilient. I am looking at better solutions to make this experience better. I will soon release it as an extension, and I plan for it to work in the following way.

The extension will contain a gateway in which you can only dispatch thoughtful commands. By thoughtful I mean:

Define a deadline by which the command should be completed
Define an action you want to be taken upon command failure
Define a retry definition

By doing this, you define your error handling approach. It would look like this in real code:

@EventHandler
public void handleEvent(AuctionWonEvent event) {
    var command = new DeductBalanceFromBankAccountCommand(
            event.winner,
            event.amount,
            event.auctionId
    );
    resilientCommandGateway.send(
            command,
            Instant.now().plus(30, ChronoUnit.SECONDS),
            RetryDefinition.ofExponential(3, 1000)
    );
}

@ResilientErrorHandler
public void handleResiliencyError(DeductBalanceFromBankAccountCommand command, String exceptionMessage) {
    // Do something with the error here!
}

The extension will achieve this by doing the following:

Dispatching a command saves it to the table in the same database transaction as the event
After the commit, the command is asynchronously sent if there are fewer than a set amount of commands that are already in flight. Otherwise, it will be queued
When the command expires or fails, an event is published to the event store containing the command and failure
The special event is handled by the @ResilientErrorHandler and you can take compensating action

This will make your command handlers resilient without putting too much strain on your event handlers. It will also work with non-Saga event handlers, so you wouldn't even need those!

Let me know what you think! I'm curious about your take on this.

Conclusion

Resilient commands are hard, but not impossible. By combining deadlines with compensating actions and good event design, we can cope with any potential failure. With my new extension, I want to remove this need, and allow you to more easily use commands in a resilient way. Stay tuned!

Processes in Axon Framework​

The best (and most expensive) resiliency​

Asynchronous resiliency​

Full resiliency​

Pitfalls of asynchronous commands​

A better solution​

Conclusion​