Working draft - not yet complete
Sharing state correctly in a sequencer-based trading environment is a challenging problem. This post discusses some of the challenges and options on how to address them.
To help illustrate the challenges, we will consider a trading environment that must be available 24/7 and is equipped with a sequencer that provides global total ordering and replayability of messages. In this setting, a Retail Order Management System (OMS) and a Smart Order Router (SOR) share reference and configuration data. Weeks after the Retail OMS is launched, an Institutional OMS is introduced, requiring access to some of the same reference data. The Retail OMS generates market data that is shared externally with a large number of clients via a dedicated FIX market data feed. Different teams develop each component—the SOR, Retail OMS, and Institutional OMS—and each process follows its own deployment schedule. For simplicity, we will assume that the entire environment is built using Java.
This raises a number of questions:
- How should the reference data be shared?
- How should the Retail OMS and Institutional OMS reference data be kept in sync?
- How can the different teams work together to ensure consistency without duplicating effort?
- How can we protect the sequencer from excessive load?
Leveraging State Machine Replication
Given a sequencer environment which provides global total ordering of messages and replay capabilities, we can leverage State Machine Replication (SMR) 00If you are not familiar with state machine replication I would suggest that you first review State Machine Replication. to share state between the different components without needing additional messaging between processes.
To illustrate the impact of the sequencer environment, let's consider a simple example that is outside of a sequencer environment first.
In this model, the Reference Data Administration Service accepts an update to the reference data from the admin team. Then, it sends a message to the Retail OMS and SOR to update their respective reference data caches. Assuming that the Retail OMS and SOR successfully process the message, the reference data is updated and consistent across the SOR and Retail OMS. However, there is no guarantee that the update will be received and processed by the SOR and Retail OMS in the same order in which they were sent. The processes may have different loads and queued up requests, therefore, the reference data is not guaranteed to be consistent across the SOR and Retail OMS as they are processing other business messages (such as orders). This could lead to a number of consistency issues, in turn leading to integration issues when the different systems need to work together.
As a contrast to the above, in a sequencer environment, we have global total ordering. We therefore have linearizable consistency. Every process will receive the update at the exact same moment in logical time. 00By this I mean that the sequencer will sequence the message and broadcast it to all the processes. All processes will receive the messages in the same order, and the reference data update would have the same sequence number for all processes. State machines are single threaded and deterministic, and process messages in the order they are received. The exact wall time at which the processes will process the message is not guaranteed to be the same, nor is this required. Given this, we can leverage SMR to ensure that the reference data is updated and consistent across the SOR and Retail OMS.
In this model, the Administration Client submits a reference data update to the sequencer. The sequencer sequences the message and sends it out to the other clients of the sequencer. They then apply the update to their internal copy of the reference data state machine. Because the sequencer guarantees total ordering and replayability of messages, the SOR and Retail OMS will receive the update in the same order and apply it at the exact same moment. The reference data is now consistent across the SOR and Retail OMS.
But not all is perfect. Now we have a Reference Data State Machine that must be shared across multiple processes. These processes are built by different teams, and reach the production environment at different times. This now introduces new challenges:
- How can we share the state machine between the different processes? Do we share the source code or a jar file? Or do we publish the protocol for the state machine and allow each team to implement their own version? If we publish the protocol, how do we ensure that the different implementations are compatible?
- What happens when we need to update the structure of the reference data? Do we update all the teams in lock step? Or do we allow teams to update their version at their own pace?
Once the Institutional OMS is launched, it will also need access to the reference data. This adds more challenges: how do we get the reference data to the Institutional OMS? Given that the reference data is now shared, can we somehow extract the reference data from another instance of the Reference Data state machine and share it with the Institutional OMS?
this post is not yet complete