Essay

Read Heavy Aeron Cluster Workloads

February 4, 2024 - Finance, Aeron Cluster - evergreen
6 mins

Most trading systems implemented on Aeron Cluster that I have worked with were write-focused. Read operations were typically broadcast to gateways, and the gateways would handle the distribution along with any specific reads. Those that had read-heavy operations were interesting as the mechanisms that can be deployed for read operations vary greatly, depending on the specific level of read consistency required by the application.

Read Consistency Models

This section provides an overview of three read consistency models tailored primarily to the needs of application use cases often encountered in trading systems. It does not exhaustively describe all possible models - for that; I’d recommend looking at Jespen’s Consistency Model section or the source material behind that, particularly Consistency in Non-Transactional Distributed Storage Systems by Viotti and Vukolić. Raft consensus serves as a tool for constructing systems resilient to failures. It is possible to overlay application protocols on Raft, catering to various read consistency models.

First up is linearizable consistency. This strong consistency model maps well to Raft, which provides the necessary synchronization required. It is a very intuitive consistency model for developers to work with. Every read operation returns the most recent write’s value by the time the read was initiated, respecting the ordering of operations.

Next, we have monotonic reads. Monotonic reads guarantee that if a read operation Y​ follows another read operation X​ from the same process or transaction, then Y will see the same or a more recent version of the data than X. This consistency model prevents a process from reading data out of order, ensuring that reads are monotonic in time. However, there is no guarantee of data recency.

Finally, eventual consistency. The only guarantee is that if no more updates are made to some data, all processes will eventually converge on the same value. There is no upper bound on the time this could take or the order of updates. In the practical raft-based systems I have worked with, this is typically expressed as stale reads. This model can be challenging to work with in financial systems.

Read Consistency Models and Raft

Linearizable Reads

We have several options to perform linearizable reads with Raft:

We can submit reads via the log. In this case, the result will be perfectly ordered across the other commands. However, reads performed this way are subject to the overhead of replicating the log entries.

We can read directly from the leader process via a side channel that avoids appending an entry to the log. In this case, reads are linearizable if the process is still the leader. To know it still holds leadership, the leader must validate its term with at least a quorum of nodes before responding. Once again, we face a coordination overhead.

Monotonic Reads

Monotonic reads introduce added complexity but do allow for reads from follower nodes. Reads can be made from any follower while maintaining the guarantee using the approach described first. A modification that supports monotonic reads from only a single follower node is described later.

This model requires a global monotonic clock alongside existing data about the current leadership term. I’ll call this global clock the “current index.” Every command coming into the Raft cluster’s state machine should increment the “current index,” and every resulting event should emit the most recent value of the clock and leadership term.

When reads are made from followers, followers open a side-channel communication with the current leader to get the “current index” value. It must then await the processing of the raft log entries locally to ensure that the internal copy of the “current index” value has reached the expected leader value. After applying all the expected log entries, the follower can respond to the read. When leaders change, followers should use the same side channel to update their “current index” value. Either clients must maintain their own internal state for the “current index”, or it will need to be pushed to the cluster and then used for validation. ZooKeeper offers similar capabilities to this via zxid.

The above approach allows a single process to make monotonic reads across any cluster node and requires coordination with the leader. If we can support monotonic reads by the same process from the same node only (i.e., not across all nodes), we can remove the need to coordinate the “current index” from the leader. In this case, the client’s last seen “current index” must be tracked, and the protocol must ensure that any read performed is equal to or greater than the “current index.”

In both cases, log compaction must take the “current index” into account.

Stale Reads

If we can loosen our constraints, we can perform stale reads from the leader and follower.

First, we might perform a stale read if we read directly from the data held within the leader by bypassing the log and without validating that it is still the leader. Reads would be linearizable during regular operation but reads may become stale during leader elections or network partitions.

If we modify the approach to read directly from followers without using the previously mentioned approach that introduced monotonic reads and without interacting with the leader, we will usually be performing stale reads.

Read Consistency Models & Aeron Cluster

Aeron Cluster supports Linearizable Reads, where reads are performed via the log. It also provides the extension points necessary to support all the additional read consistency models described above.

Side channel communications can be wired up within the Clustered Service Agent’s duty cycle. You must add the necessary protocol and additional Aeron publications and subscriptions yourself. See doBackgroundWork within ClusteredService. Note that any work performed during the doBackgroundWork duty cycle can impact the latency of the replicated state machine. Using this approach retains the existing single-threaded business model.

Stale reads could be performed via an in-process Aeron Cluster Client attached directly to the cluster nodes. In this case, it would need to reach across threads, and thread-safe data access would need to be guaranteed using standard Java mechanisms. This approach would likely introduce locking and constrain the performance of the replicated state machine. Developers would need to ensure that any read operations performed are deterministic. If developers wanted to ensure that reads are usually linearizable, they would need to introduce an application level protocol that allowed the clients of the in-process Aeron Cluster client to know if they were speaking to the current cluster leader, or not.

Another mechanism that can be used to support stale reads is to extract the cluster-hosted replicated state machine and run it against a copy of the raft log from the leader within an external process. The external process can be constructed using code similar to Cluster Backup. Stale reads can be made from this process once the process is successfully accepting and processing the cluster logs.

The “current index” capability can be easily added to the application layer and protocol.