Primary-backup replication is one of many approaches to building a fail-stop fault tolerant replicated state machine. Four approaches on defining a primary-backup protocol are discussed below:
- Alsberg and Day (1976)
- Tandem Global Update Protocol (~1984)
- Budhiraja, Marzullo, Schneider, and Toueg (1993)
- Scales, Nelson and Venkitachalam (2010)
Alsberg and Day
The first is the protocol defined by Alsberg and Day in 1976. The protocol they described operates as follows:
- A client sequences and sends an update request to the primary server. The client is now blocked.
- The primary accepts the update request, performs the update, and then forwards the update to the backup server as a cooperate request.
- The backup server then accepts the cooperate request and performs the update. It then creates a backup request to a third server, sends an update acknowledgment back to the client and finally, it will send a cooperate acknowledgment back to the primary
- The client then unblocks
Failure detection operates via two mechanisms: missing/lost acknowledgment messages and periodic 'are you alive' messages. If the primary fails, the backup takes over as the new primary. When the primary has no backup, there is an additional protocol to recruit a server as a new backup.
Visually, the Alsberg and Day protocol operates as follows:
The Alsberg and Day paper does not answer:
- how it would operate with multiple clients (in the book Distributed Systems, the authors suggest that the primary blocks after receiving the client update request until such time as it receives the cooperate acknowledgment, but I don't see this in the original paper. By blocking, all client requests are made sequentially.)
- how we can know that update operations behave consistently across nodes - there is no mention that the update operation must be deterministic so that the nodes remain consistent
- how read operations might work
- how to safely manage network partitions - indeed, the protocol will suffer from split brain. During a network partition, the two nodes will diverge, and they don't provide a viable solution to solve this.
Tandem Global Update Protocol
The second protocol is the Tandem Global Update Protocol (circa 1985). This paper introduced a Locker component that would grant permission for a sender to broadcast a message. At first, the sender reaches out to the Locker, requests permission to send, and then only once granted does it broadcast to other processes. Note that communications are blocked while awaiting permission from the Locker, so there is no need for sequence numbers.
The protocol is approximately as follows:
- Sender requests permission from Locker to broadcast
- Locker grants permission when not locked
- Sender broadcasts to destinations (in effect active/active primary servers)
- Destinations ack
- If destinations do not ack before timeout, sender resends
- Once all senders have acknowledged, Sender informs Locker, and the Locker is free to grant a new send.
Note that the Tandem Global Update Protocol makes use of broadcast messages from the sender to destinations.
Visually:
Budhiraja, Marzullo, Schneider, and Toueg
The third example comes from the book Distributed Systems 2nd Edition (1993). The authors describe a simple primary backup protocol that operates as follows:
- A client sends a request to the primary server
- The primary server accepts the request and performs the update locally
- The primary server forwards the updated state to the backup without awaiting a response
- The primary server responds to the client
- The backup server applies the updated state to itself
Visually, it operates as:
Behind the scenes, the protocol sends regular 'dummy' messages from the primary to backup server. Should the backup not get a message from the primary for a set period of time, then the backup will assume the primary has failed and inform the client that it is now primary.
The book demonstrates how a simple primary backup protocol might work, but their protocol suffers from:
- no clarity on how to operate with multiple clients
- issues with backup server failures
- how to deal with network partitions (they work on the assumption the network has no faulty links, with an upper bound on message delivery time)
- how read operations might work
Scales, Nelson and Venkitachalam
The final example comes from 2010 - Scales, Nelson and Venkitachalam described an approach for replicating single core virtual machines as replicated state machines. The approach solves the split brain problem, and is especially interesting to me as it has a solution for non-deterministic commands that works in their specific environment. In this case:
- both primary and backup virtual machine managers connect to an external service to determine the leader, thus solving split brain errors
- when the primary virtual machine manager receives an input (a network event or some other interrupt), the request is ordered by the virtual machine manager and sent into the primary virtual machine for processing. The same input is copied to the backup virtual machine manager via a log (which in their paper is a network pipe, not a file as in RAFT), which in turn submits it to the backup virtual machine
- if the primary virtual machine manager notices that a non-deterministic action is requested in the CPU, it will capture the state (i.e. output data) of that operation and send it via the log to the backup virtual machine manager
- if the backup virtual machine manager notices that a non-deterministic action is requested in the CPU, it stops processing and awaits the log input from the primary virtual machine manager which will include the state of the non-deterministic operation. This state is then applied to the backup virtual machine, and processing continues.
- both the primary and backup virtual machines then execute and optionally prepare and send responses. The client receives a response only from the primary virtual machine as the backup virtual machine manager knows that the backup virtual machine is not primary, and just swallows the response.
Visually, the Scales, Nelson and Venkitachalam protocol operates as follows (showing only output and input only operations):
Note that their approach could only ever work for a single core virtual machine. My current understanding is that VMWare deploys state replication (i.e. memory level synchronization) approaches for multicore virtual machine replication. This is also a similar approach used for ultra-low latency service replication (reflective direct memory access technology that allows ~8 nodes to be synchronized in under 1us).