The Chain Replication family offers an interesting approach to build systems with high throughput and high availability while still offering strong consistency guarantees. The protocols are especially useful for read heavy loads. Three protocols are discussed below:
- Chain Replication (2004)
- Chain Replication with Apportioned Queries (2009)
- Hermes (2020)
Chain Replication
Chain replication has a very different structure to most high availability services, for example:
- Chain replication has no capability to elect a leader or deal with split brain internally; it must rely on an external configuration manager to provide this service
- Chain replication can offer higher performance than most other replication protocols while also preserving linearizability. The
head
node is not as heavily taxed as RAFT's leader, and since read operations bypass the leader, there is no need to replicate read commands as you must do in RAFT. - Unlike RAFT, which can carry on providing high performance even with a slow follower, Chain Replication's performance will suffer if any node runs slowly.
- Again, unlike RAFT, chain replication can be very easily sharded.
A typical deployment architecture would look something like this:
- All nodes connect to a configuration manager
- The connectivity logically flows from a
head
node to atail
node via a chain ofreplicas
- Clients connect to both the
head
andtail
nodes, depending on the operation they are performing.
Protocol
Assumptions:
- The nodes are connected to a highly available configuration manager (for example, ZooKeeper), and all nodes are correctly configured in a chain.
- One node is set to be the
head
, another thetail
, and a chain of intermediate nodes (replicas
) are between them. - Writes are sequenced (i.e. have a monotonically incrementing counter assigned)
The write protocol operates as follows:
- client connects to the
head
node, and submits the write request - the
head
updates its internal state and forwards the update to the nextreplica
in the chain - the process repeats until the
tail
node has received the update - once the
tail
node processes the update, thetail
replies to the client and the update is committed.
The read protocol operates as follows:
- client connects to the
tail
node, and submits the read request - the
tail
node responds with the latest value.
Fail-over scenarios are simple to understand, and do not suffer the same complexity found in RAFT:
- If the
head
node fails, the configuration manager will notice and take it out of service. The nextreplica
in the chain is selected to be the head, and the configuration of the chain is updated. - If a
replica
node fails, the configuration manager will again notice and take it out of service. The chain is reconfigured to ignore the out of servicereplica
. After reconfiguration, the node to the right side (i.e. tail side) will need to verify from the node to its left (head side) what writes are missing and apply them - the sequence can be used for this. - If a
tail
node fails, the configuration manager will once again notice and take it out of service. The chain is reconfigured so that the right mostreplica
(the one prior in the chain to thetail
) becomes the newtail
node. - In the case of node failure, all nodes are required to keep on trying to send messages further down the chain until they receive an updated configuration from the configuration manager.
In terms of the clients:
- If the client submits an update to the
head
, and it crashes without submitting it further, then the client will not receive an acknowledgement. The client must then retry the request. - If the client submits an update to the
head
, and it crashes after submitting it further, then the client will receive an acknowledgement from the tail. The client has nothing further to do. - If the client submits an update to the
head
, and thetail
crashes before acking, then the client will not receive an acknowledgement from the tail. The chain will be reconfigured and there will be a newtail
selected. The client must retry the request - but duplicate suppression (working from the client's sequence) will lead to the write being ignored (Open question on how the client receives a write acknowledgement in this scenario).
CRAQ
Chain Replication provides good performance, but the tail can become a hotspot. CRAQ (Chain Replication with Apportioned Queries) seeks to improve throughput by allowing reads from any node but the tail. Much like Chain replication, CRAQ requires an external configuration manager to monitor node health and layout.
The CRAQ protocol differs from Chain replication by introducing dirty
and clean
states atop versioned objects, and having the head node respond to writes after the write acknowledgement makes its way from the tail
to head
node.
The write protocol operates as follows:
- each node keeps a list of versions (with the version number monotonically increasing) of each object; each version can be
clean
ordirty
- when the head node receives a write request, it adds that version to its local list of versions of the key and marks it
dirty
, and forwards the update down the chain to towards thetail
- once the new version reaches the
tail
, the tail accepts the new version and marks itclean
. The version is now committed, and thetail
sends ACK back up the chain to thehead
. - as each ACK moves from the tail to the
head
, it marks the local versionclean
The read protocol operates as follows (note the tail
cannot be read from):
- if the latest version of the requested object is
clean
, it is returned immediately - if the latest version of the requested object is
dirty
, the node will reach out to thetail
to get the latest version, and then return that.
Note that race conditions can happen with CRAQ, for example:
- Object A gets written to the
head
, gets versioned as version3
Head
submits the write toreplica A
, which sets the version to3
and marks it dirty- Client requests object A from
replica A
. Replica A
requests the committed version from thetail
, returning version2
to the clientTail
commits version3
, which is nowclean
This problem is however generally acceptable since CRAQ ensures that reads are monotonic on the whole chain. In other words, it would be impossible for a read to return an earlier version, only the same or a newer version, regardless of which nodes the read requests went to.
The CRAQ paper suggests an improvement that could be added: using multicast to update nodes on write requests.
In this case the protocol differs slightly - the changes are multicast, and then a new metadata
event is sent down the chain inform to ensure that all nodes are aware of an update.
If the node has no knowledge of the update (i.e. multicast message was not delivered to the node), the node can request the data from its predecessor in the chain and then continue sending the metadata
message down the chain.
Once the metadata message has reached the tail, a multicast ACK is sent by the tail to commit the update.
Hermes
Hermes evolves the approach of CRAQ and multicast to deliver further improvements to both read and write throughput. One of the biggest differences is that in Hermes the head
and tail
nodes are collapsed into a single node called the coordinator
. To improve throughput, Hermes allows parallel writes to different keys, although writes to the same key are performed sequentially.
In the paper, the authors show Hermes responding to 400 million requests a second at ~45us at the 99th percentile with 95% read to 5% write ratio. This compares with CRAQ at 150us at the 99th percentile at 400 million requests/second, also with a 95% read to 5% write ratio. Note that testing for both CRAQ and Hermes was performed using Mellanox RDMA hardware.
Notes:
- an external membership service is responsible for node health and layout, much like in CRAQ and Chain Replication
- each versioned object can be in one of the following states:
Valid
,Write
,Invalid
,Replay
andTrans
(as in transient). - each versioned object is versioned with Lamport clocks
- Hermes also supports Read-Modify-Write (RMW) operations; this is not discussed here - please refer to the paper for more detail. The RMW support allows Hermes to support Replicated State Machines
The Hermes protocol operates as follows for writes:
- when writing, the client submits the write to any node
- the receiving node assumes the role of a
coordinator
and confirms that the key to be written is currently in theValid
state. If the node's key is not in theValid
state, the write is stalled until it is. If the node's key is in theValid
state, it increments the lamport clock for the key version and appends a node id and assigns this as the timestamp for the write. - the
coordinator
node then broadcasts aninvalidation
message to the other replicas. This contains the key, new value and the timestamp. Thecoordinator
node then sets the key state toWrite
. - the
follower
nodes then receive theinvalidation
message from thecoordinator
. Thefollower
checks the timestamp received against its local timestamp of the key. If the timestamp is higher than the local timestamp:- the key is moved to the
Invalid
state, or to theTrans
state if it was in theWrite
orReplay
state - the
follower
then updates the key to the new value and updates the local timestamp to the received timestamp
- the key is moved to the
- the
follower
then sends back anACK
with the timestamp received in theinvalidation
message. - once the
coordinator
receives theACK
messages from all live nodes (as defined by the membership service), it transitions the key to theValid
state unless the key was in theTrans
state, in which case it is set to theInvalid
state. - if the
coordinator
transitioned to theValid
state, it broadcasts avalidated
message to thefollowers
. This message includes the current timestamp of the key. - the
followers
then receive thevalidated
message, and transition toValid
if the timestamp matches their local key. If the timestamp does not match, the message is ignored.
The Hermes read protocol is as follows:
- the client submits the read request to any node
- the node responds immediately if the key state is
Valid
, otherwise the read is stalled until the key becomesValid