Code Driven Mitch | Code Driven Mitch

The How Of Saga Resiliency

July 10, 2023 · 8 min read

When multiple parts of your application connect via asynchronous processes such as event handlers and commands, ACID transactions are nowhere to be found. You need to make sure your asynchronous processes are resilient enough to withstand failure by making the right decisions and creating the correct implementation. During my time as a Solution Architect with AxonIQ, I have helped many customers achieve the most resilient solution for their processes and Sagas. So how do we achieve resilient systems?

Processes in Axon Framework

Axon Framework applications often consist of one or more aggregates. Whenever these interact with each other, or when one interacts with the outside world, you want that to be done asynchronously through event processors. This is because aggregates are consistent and create a lock during command handling. If you then execute lengthy processes or external calls, this locks your aggregate and can frustrate your users trying to do something.

Besides frustrating your user, executing this logic from the aggregate causes you to need to think about error compensation when infrastructure errors occur, for example if the secondary command cannot be executed. This riddles your domain logic with infrastructure concerns.

We therefore move these processes to stateless @EventHandler methods, or stateful @SagaEventHandler methods if we need to keep state between different events. These events then dispatch a command to a different aggregate, or interact with the external system and send a command back to the originating aggregate. You can find out more in the reference guide, but for now we will use this simple example for the rest of the blog:

@EventHandler
public void handleEvent(AuctionWonEvent event) {
    commandGateway.send(new DeductBalanceFromBankAccountCommand(
        event.getWinner(),
        event.getAmount()
    );
}

This sample is the most basic and non-resilient option there is. It provides at-most-once delivery semantics as it sends the event without waiting for a response. There is no action if the command fails, so if that happens the process will be disrupted.

The best (and most expensive) resiliency

We can wait for the response in the same thread as the event processor. This will block the event processor thread, but ensure we process the response in the same transaction as the token update to the database. If it fails or times out, the event processor will retry it (unless you are using thedefault LoggingErrorHandler) indefinitely (or put it into the Dead-Letter Queue if you configured that). You can see that in the following sample:

@EventHandler
public void handleEvent(AuctionWonEvent event) {
    commandGateway.sendAndWait(new DeductBalanceFromBankAccountCommand(
        event.getAccountId(),
        event.getAmount()
    );
}

This switches the semantics to at-least-once delivery. Make sure the component receiving the command is idempotent, as you could receive the command twice, thrice, or 200 times depending on retries. It's smart to surround the sendAndWait with a try-catch statement in which you define what to do when the error occurs.

As already said, this will block the event processor threads. If you have many events within the same segment, this can cause a delay in the asynchronous process as they are handled sequentially.

Asynchronous resiliency

We can also adopt the first sample to include resiliency asynchronously, as it returns a CompleteableFuture. Take a look at the following sample:

@EventHandler
public void handle(AuctionWonEvent event) {
    var command = new DeductBalanceFromBankAccountCommand(
        event.getAccountId(),
        event.getAmount()
    );
    commandGateway.send(command)
            .exceptionally(throwable -> {
                System.out.println("Exception occurred: " + throwable.getMessage());
                commandGateway.send(new AuctionBalanceActionFailedCommand(...));
                return null;
            });
}

This already makes our first sample reliable, right? Unfortunately it's still not resilient enough. Ask yourself the following questions:

What happens if the JVM goes down between dispatching the command and handling its response?
What if the AuctionBalanceActionFailedCommand also fails? And what if the command correcting that fails?
What if the command fails, but the network connection is interrupted and the response is not received?

After considering these questions, I think we can conclude that this solution is still not ideal. It's still reasonably possible for processes to get interrupted without a way to recover them manually. We can improve this a bit through deadlines.

Full resiliency

A deadline in Axon Framework is a trigger based on time. It acts like a command and can target a Saga or Aggregate, so this approach is not suitable for regular event handlers. However, this can help us cope with the possibility of the JVM going down before processing a command's response. At this point in time, it's the most reliable pattern out there. What you do is:

Define a command handler which always has an event as functional output
A Success event
A failed event (e.g. if validation fails)
Dispatch the command using the regular send method, so it's asynchronous
Schedule a Deadline to trigger in a reasonable timeframe you expect the command to succeed (e.g. 30 seconds)
Take compensating action when the deadline or failed event is called in the Saga

This way, no matter what the reason, compensatory action is always taken, no matter the cause. So how does this look? First, this is what the @CommandHandler of the desination aggregate would look:

@CommandHandler
public void handle(DeductBalanceFromBankAccountCommand cmd) {
    if(balance < cmd.getBalance()) {
        AggregateLifecycle.apply(new BalanceDeductionFailedEvent(
            this.accountId,
            "balance not high enough")
        );
    }
    AggregateLifecycle.apply(new BalanceDeductedEvent(
        this.accountId,
        cmd.getBalance())
    );
}

As you can see, we don't throw exceptions so the Saga can always react to an event. This keeps your Saga progressing fast with very small actions, instead of blocking the thread for an extended amount of time awaiting a potential error.

So how do we handle this in the Saga? Let's take a look.

@SagaEventHandler(associationProperty = "auctionId")
public void handle(AuctionWonEvent event, DeadlineManager deadlineManager) {
    var command = new DeductBalanceFromBankAccountCommand(
            event.winner,
            event.amount,
            event.auctionId
    );
    // Schedule a deadline for 30 seconds from now
    Instant expiry = Instant.now().plus(30, ChronoUnit.SECONDS);
    this.deadlineId = deadlineManager.schedule(expiry, "deduction-failed");
    commandGateway.send(command);
}

@SagaEventHandler(associationProperty = "auctionId")
public void handle(BalanceDeductedEvent event, DeadlineManager deadlineManager) {
    // Success! Remove future deadline
    deadlineManager.cancelSchedule("deduction-failed", deadlineId);
    // Do more business stuff
}

@SagaEventHandler(associationProperty = "auctionId")
public void handle(BalanceDeductionFailedEvent event) {
    // Do something with the failure
    handleDeductionFailure(event.reason);
}

@DeadlineHandler(deadlineName = "deduction-failed")
public void handleDeductionDeadline() {
    // Do something with the expired deadline
    handleDeductionFailure("Deadline expired");
}

private void handleDeductionFailure(String reason) {
    // Take compensating action
}

It's some work, but it is the most resilient way to create your inter-aggregate processes! This is why I always emphasize you need a good aggregate design. Don't make them too large, as they will give performance problems, but making them too small will make them very chatty. And as you probably remember from the telephone game in high school, this is very hard to do right.

Pitfalls of asynchronous commands

Keeping everything asynchronous is great. But if there is a backup down the line, the pipes will become full and eventually overflow. Axon Server can hold 10.000 commands in its queue before it starts to reject them. As commands are not sourced, this can become a source of problems. By taking the responsibility of synchronisation off of event processor threads, you put it on this queue which is not persisted. You can overwhelm your system and potentially lose commands. This is a problem for resiliency.

A better solution

I think we can improve the amount of work and thinking that are needed to make a process resilient. I am looking at better solutions to make this experience better. I will soon release it as an extension, and I plan for it to work in the following way.

The extension will contain a gateway in which you can only dispatch thoughtful commands. By thoughtful I mean:

Define a deadline by which the command should be completed
Define an action you want to be taken upon command failure
Define a retry definition

By doing this, you define your error handling approach. It would look like this in real code:

@EventHandler
public void handleEvent(AuctionWonEvent event) {
    var command = new DeductBalanceFromBankAccountCommand(
            event.winner,
            event.amount,
            event.auctionId
    );
    resilientCommandGateway.send(
            command,
            Instant.now().plus(30, ChronoUnit.SECONDS),
            RetryDefinition.ofExponential(3, 1000)
    );
}

@ResilientErrorHandler
public void handleResiliencyError(DeductBalanceFromBankAccountCommand command, String exceptionMessage) {
    // Do something with the error here!
}

The extension will achieve this by doing the following:

Dispatching a command saves it to the table in the same database transaction as the event
After the commit, the command is asynchronously sent if there are fewer than a set amount of commands that are already in flight. Otherwise, it will be queued
When the command expires or fails, an event is published to the event store containing the command and failure
The special event is handled by the @ResilientErrorHandler and you can take compensating action

This will make your command handlers resilient without putting too much strain on your event handlers. It will also work with non-Saga event handlers, so you wouldn't even need those!

Let me know what you think! I'm curious about your take on this.

Conclusion

Resilient commands are hard, but not impossible. By combining deadlines with compensating actions and good event design, we can cope with any potential failure. With my new extension, I want to remove this need, and allow you to more easily use commands in a resilient way. Stay tuned!

Home is Where I Set My Laptop

July 3, 2023 · 7 min read

I will be making the switch to a nomadic lifestyle soon. I have sold my house, and the new owners will get the keys on October 2nd, exactly three months away. Home will be where I put my laptop down. Since I want to still be productive, I need to make changes. My clunky home setup will not do!

This is why I'm becoming a nomad

June 24, 2023 · 6 min read

I have decided to switch my lifestyle and become a digital nomad. The idea of working from anywhere in the world, experiencing other cultures, and being free to go wherever you want, has a tremendous pull on me. Besides announcing that I will become one, this blog will explain why I chose this path.

Optimizing Event Processor Performance

January 11, 2023 · 12 min read

This blog was originally posted on the AxonIQ Developer portal

Like parents want what's best for their children, we developers want what's best for our application. We concern ourselves with many aspects of our application, like modularity, readability of code, and, of course, performance! In this blog, we'll dive into how you can tune your event processors in the best way possible.

This blog will present many optimizations, most of which come with caveats that should be minded when implementing them. Each optimization will have its limitations outlined in its respective section; please read them carefully before implementing them. Make optimizations one at a time and measure the results before implementing another. That way, measures that have a negative impact on a specific use case can be isolated from those that have a positive one.

Introducing the Axon Framework IntelliJ plugin

March 11, 2022 · 4 min read

This blog was originally posted on the AxonIQ Developer portal

Axon Framework provides the building blocks that CQRS requires and helps developers to create scalable and extensible applications while maintaining application consistency in distributed systems. It helps by handling the communication of messages, such as commands, events, and queries.

However, your IDE does not understand this, and so it will leave finding message handlers or publishers up to your memory. We are introducing the IntelliJ Axon Framework plugin to solve this problem. You can download it via the Jetbrains marketplace within IntelliJ or the marketplace website.

Replaying a single aggregate

October 15, 2021 · 6 min read

This blog was originally posted on blog.the-experts.nl.

A recent migration of functionality to a new service went wrong. When ships cross the exit line of the harbor, we check if currently an open visit exists for that ship. If so, we close the visit and delete it from certain projections.

Then something a ship did I never imagine it would; it turned around right before entering the harbor, sailing back over the exit point line. When it wanted to enter, the Harbor Control Center could not find the visit, since it was closed.

Besides fixing the validations in our aggregate, we needed a way to restore the data. We could replay the entire projections. However, over the past 6 months we have accumulated around 2.5M events and even though we tuned it very well, a replay would take an estimated two hours. This replay would make the data for all other ships invalid with the Harbor Control Center having to fall back to painful methods. Not a good option.

We needed a way to replay events for a single aggregate. There was nothing on the topic to be found on the web, except questions a bout how to do it and some vague "I don't understand why you would do such a thing" or "It's not possible" replies.

Metadata in Axon Framework

May 30, 2021 · 9 min read

This blog was originally posted on blog.the-experts.nl.

All commands, queries, and events in Axon Framework are messages that can have metadata attached to them. It is a map structure where you can add additional data about the event that is not part of your domain model. Authentication and audibility can be one of them; it’s great to know which user executed an action on your system, but it might be irrelevant for your domain. You could add the username to the command and the event, but this pollutes your domain model. In addition, it is not very DRY since you will have to repeat this for every command and event.

Metadata can help you achieve these goals while not repeating yourself or polluting your domain model. When dispatching a command, or applying an event, you can provide a map of the metadata you want to pass along with it. This is done in the next example, where my profile is created by a user named “hansklok”.

Dealing with personal data in Axon Framework

May 10, 2021 · 7 min read

This blog was originally posted on blog.the-experts.nl.

AxonIQ has a module for Axon Framework that can encrypt data in events, so you don't have to worry about it. Still, I think this blog has educational purposes, so I'm keeping it here. The solution posed in this blog is still running in production at the Port of Rotterdam.

Privacy regulations are a pain

By now we all know about GDPR, right? It’s the privacy regulation of the EU that gives the customer certain rights about his or her personal data. For instance, they have a right to retrieve all data related to them, or to have certain or all data deleted.

This presents us with a dilemma.

Processes in Axon Framework​

The best (and most expensive) resiliency​

Asynchronous resiliency​

Full resiliency​

Pitfalls of asynchronous commands​

A better solution​

Conclusion​

Privacy regulations are a pain​

Processes in Axon Framework

The best (and most expensive) resiliency

Asynchronous resiliency

Full resiliency

Pitfalls of asynchronous commands

A better solution

Conclusion

Privacy regulations are a pain