Coordinating shared state across multiple state machines within a single application in a sequenced trading system presents unique challenges. This post explores these issues and proposes solutions using State Machine Replication00If you are not familiar with state machine replication I would suggest that you first review State Machine Replication. techniques.
To help illustrate the challenges, we will consider a trading environment that must be available 24/7 and built atop a sequencer that provides global total ordering and replayability of messages. In this setting, an Order Management System (OMS) and a Smart Order Router (SOR) share reference and configuration data. Weeks after the OMS is launched, an internal Matching Engine is introduced, requiring access to some of the same reference data. Different teams develop each component—the SOR, OMS, and the Matching Engine—and each process follows its deployment schedule. We assume that the entire environment is built using Java for simplicity.
This raises several questions:
- How should the reference data be shared?
- How should the OMS and Matching Engine reference data be kept in sync?
- How can the different teams work together to ensure consistency without duplicating effort?
I will introduce potential solutions to these problems with independent state machines — a state machine for the reference data and a state machine for the rest of the application logic — composed together into a single application.
Leveraging State Machine Replication
Given a sequencer environment which provides global total ordering of messages and replay capabilities, we can leverage State Machine Replication (SMR) to share state between the different components without needing additional messaging between processes.
To illustrate the impact of the sequencer environment, let’s consider a simple example that is outside of a sequencer environment first.
In this model, the Reference Data Administration Service accepts an update to the reference data from the admin team. Then, it sends a message to the OMS and SOR to update their respective reference data caches. Assuming that the OMS and SOR successfully process the message, the reference data is updated and consistent across the SOR and OMS. However, there is no guarantee that the update will be received and processed by the SOR and OMS in the same order in which they were sent. The processes may have different loads and queued up requests, therefore, the reference data is not guaranteed to be consistent across the SOR and OMS as they are processing other business messages (such as orders). This could lead to a number of consistency issues, in turn leading to integration issues when the different systems need to work together.
As a contrast to the above, in a sequencer environment, we have global total ordering. We therefore have linearizable consistency. Every process will receive the update at the exact same moment in logical time. 00By this I mean that the sequencer will sequence the message and broadcast it to all the processes. All processes will receive the messages in the same order, and the reference data update would have the same sequence number for all processes. State machines are single threaded and deterministic, and process messages in the order they are received. The exact wall time at which the processes will process the message is not guaranteed to be the same, nor is this required. Given this, we can leverage SMR to ensure that the reference data is updated and consistent across the SOR and OMS.
In this model, the Administration Client submits a reference data update to the sequencer. The sequencer sequences the message and sends it out to the other clients of the sequencer. They then apply the update to their internal copy of the reference data state machine. Because the sequencer guarantees total ordering and replayability of messages, the SOR and OMS will receive the update in the same order and apply it at the exact same moment. The reference data is now consistent across the SOR and OMS.
The OMS and SOR state machines can consume the reference data by one of several mechanisms:
- A data structure constructed out of the reference data state machine internal data can be shared with the OMS and SOR. Should they need any reference data, they can simply read it from the shared data structure. This data structure can be updated by listening to update events from the reference data state machine.
- A view can be constructed from the reference data state machine. The view can be shared with the OMS and SOR.
They can then read the data directly from the view using the view's API (for example, a
getInstrumentById
method). - The OMS and SOR can attach listeners to the reference data state machine. They can be notified when the reference data is updated and can react to the update, capturing the data state internally.
In all cases, the OMS and SOR state machines are somewhat decoupled from the reference data state machine. They do not need to know anything about the reference data state machine protocol, any internal structures or logic, and they do not need to coordinate with each other. Yet they have real-time access to the correct state.
As the reference data protocol or logic is updated, the OMS and SOR can simply adopt an updated jar, and adapt to the new APIs as required.
We now face a new challenge: how do we get the reference data to the Matching Engine? Do we need to seed the Matching Engine with the reference data? Or can it consume a snapshot of the reference data state machine?
To solve this, we introduce some changes:
- We add a new Reference Data Service that hosts a copy of the reference data state machine.
- We have the Reference Data state machine accept updates from the Administration Client, and apply them to its internal state.
- The Reference Data Service produces a snapshot of its state machine, which is shared with any process that needs to seed its reference data state machine.
- Once we install the Matching Engine, we seed its reference data state machine using the snapshot shared from the Reference Data Service.
This does introduce some additional constraints, most notably that the snapshots MUST be taken at a consistent point in time across all state machines (Reference Data, Matching Engine, OMS and SOR) running on the sequencer. This is to ensure that the snapshot is consistent across all state machines when we have cross-state machine dependencies. An additional constraint is that all the state machines composed together within the OMS, SOR and Matching Engine MUST be run on the same single thread.