Distributed transactions in microservices: failure design

A hard challenge is a distributed transaction when it comes to microservices.
Understand the implicit network cost, failures and solutions when writing microservices based on an Ordering and Payment system.

Someone who professes their love for microservices probably hasn’t done it correctly. One who finds it mildly inconvenient is probably doing something not completely wrong. One who hates it intensely is probably a DevOps Engineer.

When writing microservices, it is easy to forget its inherent distributed and fragile nature, especially if an entire team is working on just a single service. We need to be aware of the implicit network cost and failure, and embrace it fully.

An ideal microservice does only one thing. It has no dependency on other services, like a hunter in a dark forest, it makes no assumption about its surrounding. It only cares about its own resiliency and survival.

On the other hand, a microservice that is aware of its surroundings is fragile. The more your application code has to deal with, the more assumption you have to make. In this post, we will look at how we can abstract distributed transactions away from the application code, and let the surrounding infrastructure deal with it.

A distributed transaction is one that spans multiple databases across the network while preserving ACID properties. If a transaction requires service A and B both write to their own database, and rollback if either A or B fails, then it is a distributed transaction.

Its prevalence in microservices is due to the distributed nature of the architecture, where transactions are usually implicitly distributed. However, it is not unique to microservices.

Hard Problem

To see why a distributed transaction is hard, let’s take a look at a textbook yet extremely common real-life example — an Ordering and Payment system.

Say we have an inventory management system, an ordering system and a payment system, each modelled as a microservice. The boundary is clear; the ordering system takes the order, the inventory system allocates the stock, while the payment system deals only with payment and refund related issues.

A single order transaction = creating an order + reserve stock + payment, in any order. Failure at any point during the transaction should revert everything before it. For example, a payment failure should cause the inventory system to release the reserved stocks, and the ordering system to cancel the order.

Forward creational state and Backward compensational state

How to Implement a Fragile Distributed Transaction

A naive implementation typically uses chained HTTP call for RPC between services. That’s fine if you’re working on a quick demo.

When an order request comes in, the ordering system takes the order, it makes an HTTP call to the inventory system to reserve stock. If the stock reservation succeeds, it calls the payment system to try to make payment with the credit card provided by the user. Otherwise, the stock is not reserved.

Happy scenario of a naively implemented distributed transaction

Now, if the payment fails, we need to roll back the stock reservation and the order creation.

A cascading local transaction rollback that relies on stable network

Some serious flaws with this approach.

Fallacy of distributed system — Relies heavily on the stability of the network throughout the transaction
Transactions could end up in an indeterminate state
Fragile to topology changes — Each system has explicit knowledge of its dependency

HTTP call blocks indefinitely, imagine payment service calls some third party API like PayPal or Stripe, where the transaction is effectively out of your control. What happens if the API is down or throttled. Or a network disruption along the network path. Or one of the three services is down, due to any of the 1000 reasons from application bug to submarine cable disruption.

We could put in place a client timeout. But what would you set? 5s? 10s? 30s? Any number is arbitrary and makes an implicit assumption of the network. In fact, when a connection timed out, it doesn’t say anything about the state of the transaction at all. It merely concludes that the call has taken more than your specified timeout.

Transaction is in an indeterminate state

Concretely, if the inventory system managed to reserve some stocks, but the payment system timed out for whatever reason, we cannot say that the payment has failed. If we treat timeout as a failure, we would have rolled back the stock reservation and cancelled the order, but the payment actually did go through. Perhaps the external payment API is taking more time than usual or experiencing a disruption in the network, so we cut off the connection before the payment service has a chance to respond. Now the transaction is in “Paid” and “Stock Released” state simultaneously.

All this knowledge about the surrounding services forces a service to deal with the specifics instead of improving its general resiliency. This primordial form of distributed transaction relies heavily on the interaction with other services, and the network being reliable. It is highly fragile to the topography changes or the slightest network disturbance.

Some may suggest things like exponential back-off Retry, but it is a nice enhancement to an already well-architected system, you cannot apply it here like a band-aid.

Let’s see how we can rearchitect this into something more robust.

A Robust Strategy

Should have the following core properties:

No explicit inter-service communication
Does not make an assumption about the reliability of the network and the services
Global transaction as a series of local ACID transactions
Transaction is always in a defined state
Transaction state is not managed
Eventually consistent, but consistent nonetheless
Reactive

We can achieve this using the Saga Pattern. It models the global distributed transaction as a series of local ACID transaction, with compensation as a rollback mechanism. The global transaction move between different defined states depends on the result of the local transaction execution. There are generally two kinds of saga implementation.

Orchestration
Choreography

The difference is the method of state transition, we will talk about the first one in this post.

Orchestration-based Saga

This type of Saga is a natural evolution from the naive implementation because it can be incrementally adopted.

Orchestrator

Or a transaction manager, is a coarse-grained service that exists only to facilitate the Saga. It is responsible for coordinating the global transaction flow, that is, communicating with the appropriate services that involve in the transaction, and orchestrate the necessary compensation action. The orchestrator is aware of the global distributed transaction, but the individual services are only aware of their local transaction.

Before we get to the concept of compensation, let’s improve the architecture a little by removing the explicit dependency between services.

Ditch the HTTP Call

Recall the inventory service that gets stuck waiting for payment service because of the RPC in the middle of its local transaction? A service’s local ACID transaction should ideally consist of just two steps:

Local business logic
Notify broker of its work done

Instead of calling another service in the middle of the transaction, let the service do its job within its scope and publish the status through a message broker. That’s all. No long, synchronous, blocking call somewhere in the middle of the transaction. We will use Kafka as the broker in this example for a reason that will become apparent later.

For the astute reader, yes, notifying the message broker breaks the ACID properties of the local transaction, which kind of defeats the whole point. What if the notification fails? Data has already been written into the database!

Event-driven Messaging via Transactional Outbox

To ensure that the two steps are in a single ACID transaction, we can make use of the Transactional Outbox pattern.

When we write the result of the local transaction into the database, the work done message is included as part of the transaction as well, into a message outbox table. This way, both local transaction and notification are in a single transaction, yet demarcated clearly. The work done message is ready to be picked up by the message broker for publishing.

Now, HOW and WHEN do we pick up the message? There must be a way for the database to notify the message broker that something has changed.

Change Data Capture

Change Data Capture (CDC) pattern does exactly. Kafka Connect is an excellent tool to capture database changes and stream it into Kafka. It is a framework that provides different vendor implementation of the connector, depending on what your Source is, be it a Postgresql database, S3 bucket, or even RabbitMQ queue. Kafka Connect provides only a few built-in connectors, and if you are using Postresql, then you have to install its connector, Debezium, which provides a connector implementation for Postgresql.

Streaming of database changes to Kafka using Debezium

Compensation

Now that all the services and orchestrators are unaware of each other and we have the local transaction problem sorted out, how do we model the global distributed transaction? Once a service has done its work, it publishes a message to the broker (could be a success or failure message). If the Payment system publishes a failure message, then the orchestrator must be able to “rollback” actions done by the Ordering and Inventory system.

In this case, each service must implement its version of compensating function. Inventory system that provides a ReserveStock function must also provide a ReleaseStock compensating function. Payment System that provides a Pay function must also provide a Refund compensating function, etc.

The orchestrator then listens to the failure events and publishes a corresponding compensating event.

A scenario where Payment fails, but ordering and stock reservation succeeded

Orchestrator publishes respective compensation request

Handling Request Idempotency

It is important for all services that are involved in a distributed transaction to handle duplicated requests correctly. Otherwise, good luck. This is usually achieved through the use of a unique Idempotency Key throughout the whole distributed transaction, and storing the key along with the API response. Whenever a wild duplicated request appears, search the key-value store for the Idempotency Key and return the API response. Otherwise, treat it as a new request.

This opens up a whole pandora box of distributed caching that we will not discuss here.

Relation to Antifragility

Though not exactly antifragile because this architecture does not thrive in the presence of a black swan event, it closely adheres to the core principle of antifragility.

The naive implementation will work perfectly probably around 90% of the time, but also subject to 10% catastrophe with huge loss (well depends on your business). The saga based architecture may seem less optimised due to the eventually consistent model, but it is tolerant of the network fluctuation, and its downside is limited in the event of a catastrophe.

Conclusion

This is not a remedy to apply “traditional transaction” at the level of a distributed system. Rather, it models transactions as a state machine, with each service’s local transaction acting as a state transition function.

It guarantees that the transaction is always in one of the many defined states. In the event of network disruption, you can always fix the problem and resume the transaction from the last known state.