Increasing Replicated State Machine Throughput in Trading Systems

EssayJanuary 28, 2024 - evergreen Finance, State Machine Replication

Replicated State Machines play a vital role in the development of fault-tolerant services, serving to safeguard key application state within trading systems. To successfully engineer a state machine that guarantees fault tolerance across replicas, it’s important to adhere to several key rules:

  • It must be deterministic, with each replica producing the same output when given the same starting state and input.
  • Every replica must receive exactly the same messages in the exact same order, regardless of network failures and other system issues.
  • Every copy of the replicated state machine must start with the exact same state.

We wrap the replicated state machine in practical systems within a consensus protocol such as Raft or Paxos. These consensus protocols manage the delivery of commands (the inputs) to the replicated state machines, providing the necessary guarantees on ordering and delivery despite machine and networking failures.

Replicated State Machine

A replicated state machine hosted within a distributed consensus protocol.

In a well-designed Raft or Paxos implementation, the Replicated State Machine tends to be isolated from other core threads of the system. It receives a steady stream of commands from the replicated log. This isolation and the requirement for determinism typically results in the Replicated State Machine being single-threaded. There are well-defined techniques to increase the throughput of consensus algorithms, but can we scale out the replicated state machine itself?

Understanding Typical Trading System Usage of Replicated State Machines

First, it is useful to understand how we use replicated state machines in trading systems. The replicated state machines host the most critical application state that must survive failures so that should a machine failure happen, trading activity is minimally impacted. Common examples of this include the Matching Engine and a Credit Limit or Credit Matrix.

Typical Operations

Most operations we perform on the replicated state machines are write-oriented and require strong ordering guarantees. For example, we may:

  • Add an order to an order book within a matching engine, often using price-time priority or similar order-dependent logic
  • Consume some amount of a finite credit limit with time-based ordering of credit consumption.

Read operations, such as market data snapshots and subscriptions of an order book, are usually handled via gateways to protect the core replicated state machine.

Orthogonal Data

Matching Engine and Credit data are typically held within a replicated state machine to provide fault tolerance. However, they are aggregated in different dimensions.

The diagram below shows how these orthogonal data sets make it hard to shard the data in the context of an over-simplified equities system:

Replicated State Machine Orthogonal Data

When the replicated log’s order data is viewed from the credit perspective, aggregation is done per client, with all the client’s orders grouped. This aggregation allows the system to compute the exact credit utilization of their orders, enabling strong guarantees that any orders exceeding pre-agreed credit limits can be rejected.

When viewed from the order book perspective, the order’s instrument will determine which order book each order lands in. The insertion order (or, more correctly, the time) and price will typically impact the likelihood of allocation should there be a match within the order book.

As a result, we can’t just shard the replicated state machine by client or instrument unless there is a fully independent system component that performs the credit operations and another component that performs matching operations.

Research into scaling out Replicated State Machines

From what I’ve seen, few papers discuss improving the throughput of replicated state machines. Many papers describe how to improve the throughput of consensus algorithms delivering commands to replicated state machines (for example, by removing network hops with optimized consensus algorithms, by introducing pipelining into consensus, or by adopting partitioned application state), but these all still rely on single-threaded replicated state machines.

I’ll discuss two interesting papers that I found that talk about improving the throughput of the state machine itself. Both ultimately discuss the underlying question – does every command need to be totally ordered? In the case of trading systems, the answer is sadly yes, so there is little practical value to the work regarding increasing the throughput of trading systems. Regardless, I found them interesting to read.

Rethinking State Machine Replication for Parallelism

This paper from 2013 introduces “P-SMR,” or Parallel State Machine Replication, which aims to retain the core external guarantees of determinism while removing any centralization that might reduce throughput. P-SMR is best suited to systems that primarily process order-insensitive independent read operations (the authors use the example of a nameserver as a good candidate).

At application development time, the application developers must label commands into categories:

  • Independent commands — these commands can be safely run entirely independent of any others without any ordering constraints; an example might be a read or an idempotent, order-insensitive write
  • Dependent commands — these commands are dependent on the order of all operations and require careful processing coordination. An example might be an operation to create a resource – this cannot run after a read or update operation on that same resource.
  • Partially dependent commands, where they are dependent on the same command parameter(s) — these commands are dependent and must be correctly ordered only when the parameters match, for example, two updates to the same resource.

This application-level knowledge of the command signatures and command dependencies is encoded into a C-Dep structure. C-Dep is used to drive the command execution plans within the server. This broadly works as follows:

  • Both client and server proxies are generated for managing interactions between the client and server processes. These proxies deal with the complexities of the ordering and parallel execution of tasks, leaving the core replicated state machine code unchanged from a typical single-threaded replicated state machine.
  • At the client level, C-Dep is used to deterministically map the command to multicast groups running replicated state machines. I find the wording a little unclear here, but I understand that this paper’s multicast group refers to a specific thread within the running server replicas. Commands that are independent and can run in parallel are multicast to a single group, while commands that introduce dependencies are multicast to multiple groups.
  • When a command arrives at the server for a single multicast group, the server proxy executes it in parallel. But when the command is dependent or partially dependent, the server proxy must coordinate execution across threads to strictly maintain ordering guarantees. P-SMR manages this with cross-thread barriers and signaling. During these times, the state machine is effectively single-threaded.

P-SMR’s focus on implementations that primarily operate on independent read operations goes against the needs of trading systems, where almost every command that reaches the replicated state machine is a dependent write operation. The paper can be found over at IEEE Xplore ($).

Performance optimization for state machine replication based on application semantics: A review

Early in the article, there is a statement that matches the immediate thought of almost everyone I speak to about the usage of distributed consensus and coordination in trading systems:

“However, state machine replication requires replicas to run deterministically and to process requests sequentially according to a total order. We argue that these constraints could impede the adoption of fault tolerance techniques in practice, for example… Executing requests at the replicated server sequentially according to a total order often results in unacceptable low system throughput and high end-to-end latency.”

The article starts with an argument that application semantics matter, and must be considered when building fault-tolerant systems. For example:

  • Does every command need to be totally ordered? If not, then easing the requirement for every command to be totally ordered can improve throughput. This is the same implicit argument made by the paper on P-SMR discussed earlier
  • The causal order of complex applications that involve interacting, dependent operations performed outside the safety of the totally ordered replicated state machine must be understood and properly modeled to ensure fault tolerance is correct. This scenario is not typical in trading systems, thankfully.
  • Application sources of non-determinism, such as uncoordinated random number generators or using other imperfectly coordinated inputs, such as time, must be removed when executing on independent replicas. In the practical trading systems I’ve worked with, these sources of non-determinism are removed and controlled via coding standards and static analysis tooling.

The article then continues to introduce a performance engineering classification framework showing that the degree of correctness required for fault-tolerant replicated state machines can vary according to the application needs and shows how, by relaxing the correctness and liveness guarantees, various additional approaches that increase throughput can be introduced.

The performance engineering classification framework groups fault tolerance approaches into two broad categories:

  • Conservative fault-tolerance
  • Optimistic fault-tolerance

As defined by the authors, conservative fault tolerance provides stronger guarantees around correctness, liveness, and worse performance when compared to optimistic fault tolerance. The correctness and liveness guarantees are critical and non-optional in trading systems, so the rest of this discussion of the article will focus on the conservative fault tolerance aspects.

The conservative fault-tolerance classification is further broken down into additional layers of performance engineering based on:

  • The types of operations on state objects. These operations may be basic (e.g., deterministic or straightforward read/write operations) or complex (e.g., non-deterministic operations such as incrementing a number, appending strings, or using pseudo-random numbers or system time)
  • The relationship between requests, such as sessionless or session-based operations. This category does not apply to trading systems in finance; it applies to multi-tiered systems that allow operation reordering at a session level while maintaining a causal order at a client level. Operation reordering cannot be allowed as this would impact the financial outcomes — for example, in a matching engine, the market participants sending orders may have their orders reordered as there is no causal order between market participants that can be known to the trading venue. This would result in an invalid matching outcome — opening the door to legal actions against the trading venue.

When running basic operations on fully independent objects, we can safely perform concurrent execution of the commands. When running basic operations on dependent objects or complex operations, the system must behave as if it is single-threaded to remain deterministic. This discussion is a generalization of the points made earlier in the P-SMR approach. Sadly, most operations in a typical trading system are either basic operations on dependent objects (e.g., an order book) or complex operations. The article can be found at the Journal of Systems and Software ($).

In Practice

Aeron Cluster brings very high throughput at the consensus layer. When used with Aeron Premium components, Aeron Cluster can deliver millions of reasonably sized messages per second to the replicated state machine holding the application logic. Ultimately, the application logic always reduces the throughput of the Aeron Cluster. Some of the fastest implementations of production trading systems running on Aeron Cluster that I know of can process several hundred thousand operations per second.

There are several approaches one can take to improve the throughput of application logic hosted within Aeron Cluster, when running on the JVM:

  • Adopt multiple clustered service containers within a single Aeron Cluster. A clustered service container is a host for a replicated state machine. In this scenario, Aeron Cluster runs each clustered service container in an independent thread, with each clustered service container receiving all messages;
  • Adopt ZeroGC/Zero Copy approaches within the application logic and/or advanced garbage collectors such as Z found in Java 21+ or the C4 collector in Azul Platform Prime;
  • Increase both throughput and latency by adopting multiple clusters.

The first option deserves additional discussion as it comes with a substantial increase in operational complexity. To illustrate this, let’s take an example of a single cluster holding a Matching Engine in one clustered service container and a Credit Utilization Manager in another. All messages between the Matching Engine and Credit Utilization Manager are performed via the distributed consensus log to retain determinism and fault tolerance.

The first complexity happens when we fail within one clustered service container and not the other, with one clustered service container now being in a non-operational state and the remaining one operational. Operationally, we must define how to detect and fix this situation correctly. The detection can be a complex problem – without going on to the machine and introspecting the process, it is tough to understand the difference between an unexpectedly slow operation and the process being stuck in an endless loop. Even with access to the logs, we cannot always determine the difference between a thread being unexpectedly terminated and the thread being stuck in an endless loop.

The second complexity happens when you need to unwind some log entries due to a log entry containing a poison pill message. With a single clustered service container, this is simple – just delete the entries from the poison pill onwards (any subsequent messages would not have been processed), and deal with the rest of the problem at the business layer. With multiple clustered service containers, this may be more complex, especially if only one of the clustered service containers was impacted by the poison pill. If there is any causal messaging across clustered service containers or any external gateways contacted by the remaining operational clustered service container, then the unwinding of operations is typically much more complex and can require significant downtime to address.

For these reasons, I have yet to see a production trading system adopt multiple clustered service containers. All the throughput gains are achieved via building highly performant Java code and/or adopting specific JVM implementations and/or running multiple clusters.

Some Aeron Cluster users take a related approach that copies the replicated state machine code into multiple host processes and within the Aeron Cluster clustered service container. However, this tends not to increase throughput as much as it reduces bandwidth.

Yet another alternate approach to improving throughput is to move the replicated state machines to components independent of the total ordering logic. In this case either very limited or no application logic is held within the process that performs total ordering. The Island or Sequencer Architecture is an example of this. However, not all implementations use replicated state machines for fault tolerance – some use simple, non-deterministic primary backup. Some implementations do, however, replicate the application state by hosting multiple independent replicated state machines in different processes, reducing the need for inter-application communication and thus improving throughput. This approach will be the topic of a future post.


The colors used in the diagrams in this post are sourced from a viewpoint in Canyonlands National Park, Utah, USA.