It is easier to understand Paxos in context of other solutions that try to solve the consensus problem but have shortcomings, so let’s talk about that.
An intuitive way of reaching consensus is to take marriage vows:
“Do you …?” “I do!” “I do!”
“I now pronounce you …”
Assume now that the marriage is not between two people, but is a bit like what happens among the Aiel folk in Robert Jordan’s Wheel of Time Series — one or more Aiel women can be first-sisters and either the man has to marry them all, or none at all. Among the Aiel, the marriage vows would perhaps work as follows:
“Do you …?” “I do!” “I do!” “I do!” …
“I now pronounce you …”
If any of the Aiel would-be-spouses does not respond with an “I do!” the wedding cannot proceed.
Computer scientists call this the two-phase commit.
Two Phase Commit (2PC)
- Voting Phase — A coordinator suggests a value to all nodes and gathers their responses (whether they agree to the value or not). For our scenario, a transaction coordinator asks whether all resource managers (database server instances) can commit to a transaction or not. The RMs reply with a yes or no.2
- Commit Phase — If everyone agrees, the coordinator contacts all nodes to let them know the value is final. If even one node does not agree, inform all nodes that the value is not final. In our scenario, the coordinator asks the RMs to commit the transaction or abort it.
Note that the vote is only on the proposed value — a node can only say yes or no. It cannot suggest an alternative. If a node wants to suggest a value, it should start its own 2PC. Clearly the algorithm works — the nodes decide on a value proposed by one of the nodes. It is also not very inefficient — for N nodes, 3N messages are exchanged.
But what happens if a node crashes? For example, assume that the coordinator crashes in phase 1 after the proposal has been sent to some of the N nodes, but not all.
- Now some nodes have started a 2PC round, while some nodes are unaware of a 2PC round happening. The ones who have started the 2PC round are blocked waiting for the next phase.
- In our scenario, a RM that has voted may also have had to lock some resources so it cannot even time out since the coordinator may recover and start phase 2.
Similar problems exist if the coordinator crashes in phase 2 after it has sent commit message to some nodes but not to all nodes. Some of these problems can be solved by having another node take over coordination duties on observing a timeout. This node can get in touch with all the other nodes to discover their votes (requires nodes to persist votes) and take the transaction to completion, but further participant failures can happen during this process and the transaction may never recover. Bottomline — 2PC cannot operate reliably in the event of node failure.
Three Phase Commit (3PC)
The key issue with 2PC is that in case the coordinator crashes, there is no one else who has the knowledge to complete the protocol. This can be solved by the addition of an extra step:
Phase 1 — same as before — a coordinator suggests a value to all nodes
Phase 2 — new step — on receiving a yes from all nodes in previous step, the coordinator sends a “prepare to commit” message. The expectation is that nodes can perform work that they can undo, but nothing which cannot be undone. Each node acknowledges to the coordinator that it has received a “prepare to commit” message.
Phase 3 — similar to second phase in 2PC — If the coordinator receives an acknowledgement from all nodes on the “prepare to commit,” it can go ahead and communicate the result of the vote to all nodes asking them to commit. However, if all nodes do not acknowledge, the coordinator aborts the transaction.
Now if the coordinator crashes at any point, any participant can take over the role and query the state from other nodes.
- If any RM reports to the recovery node that it has not received the “prepare to commit” message, the recovery node knows that the transaction has not been committed at any RM. Now either the transaction can be pessimistically aborted, or the protocol instance can be re-run.
- If a RM that has committed the transaction crashes, we know that every other RM would have received and acknowledged the “prepare to commit” message since otherwise the coordinator would not have moved to the commit phase. So the coordinator can proceed with the last phase.
So 3PC works well despite node failures. This is at the cost of adding one more step across N nodes which results in higher latencies.
Where 3PC falls short is in the event of a network partition. Assume that all RMs that received “prepared to commit” are on one side of the partition, and the rest are on the other side. Now this will result in each partition electing a recovery node that would either commit or abort the transaction. Once the network partition gets removed, the system gets in an inconsistent state.
Paxos — Why Bother?
First things first — given 3PC do we even need something better? The only problem is network partition, right? To begin with, let’s assume that network partition is the only problem (it is not, as we shall soon see). Is correctness in the event of network partition a problem worth solving? Today, with cloud computing and internet-scale services where nodes may be on different sides of a continent or across oceans, we certainly need a partition tolerant algorithm.
The second point is that network partition is not the only problem. While we tackled the case of a node failing permanently, the more general case is that the node crashes, then recovers and resumes from where it left. This fail-recover model can also describe an asynchronous network model where there is no upper bound on amount of time a node can take to respond to a message, since you can never assume a node to be dead — they may just be slow or the network may be slow. You cannot timeout in this model.
3PC is fail-stop resilient, but not fail-recover resilient. Unfortunately real life requires fail-recover and hence we need a more general solution. This is where Paxos comes in.
Paxos — How it Works
The basic steps in Paxos are very similar to 2PC:
- Elect a node to be a Leader / Proposer
- The Leader selects a value and sends it to all nodes (called Acceptors in Paxos) in an accept-request message. Acceptors can reply with reject or accept.
- Once a majority of the nodes have accepted, consensus is reached and the coordinator broadcasts a commit message to all nodes.
The key difference from 2PC is that unlike 2PC where all nodes need to agree, here only a majority needs to agree. This is an interesting idea because there is at least one common node in two separate majorities. This ensures that in a given round, if the majority has agreed upon a certain value, any node trying to propose a value subsequently would learn that value from other nodes and would agree to that value only. This also means that Paxos does not block even if half the nodes fail to reply.
Of course what can also happen is that the Leader itself may fail. To handle this, Paxos does not mandate a single Leader at a given point. It allows that any node may make itself a Leader and try to coordinate the transaction. This naturally means that at a given point in time more than one node may exist that believes itself to be the Leader. In such a case it is likely that the two Leaders may be proposing different values. So how is consensus reached? To achieve consensus in this setup, Paxos introduces two mechanisms:
- Assigning an order to the Leaders. This allows each node to distinguish between the current Leader and the older Leader, which prevents an older Leader (which may have recovered from failure) from disrupting consensus once it is reached.
- Restricting a Leader’s choice in selecting a value. Once consensus has been reached on a value, Paxos forces future Leaders to select the same value to ensure that consensus continues. This is achieved by having acceptors send the most recent value they have agreed to, along with the sequence number of the Leader from whom it was received. The new Leader can choose from one of the values received from the Acceptors, and in case no one sends any value, the Leader can choose its own value.
1) Prepare Phase:
- A node chooses to become the Leader and selects a sequence number x and value v to create a proposal P1(x, v). It sends this proposal to the acceptors and waits till a majority responds.
- An Acceptor on receiving the proposal P1(x, v1) does the following:
- If this is the first proposal to which the Acceptor is going to agree, reply ‘agree’ — this is now a promise that the Acceptor would reject all future proposal requests < x
- If there are already proposals to which the Acceptor has agreed:
- compare x to the highest seq number proposal it has already agreed to, say P2(y, v2)
- If x < y, reply ‘reject’ along with y
- If x > y, reply ‘agree’ along with P2(y, v2)
2) Accept Phase
- If a majority of Acceptors fail to reply or reply ‘reject’, the Leader abandons the proposal and may start again
- If a majority of Acceptors reply ‘agree’, the Leader will also receive the values of proposals they have already accepted. The Leader picks any of these values (or if no values have been accepted yet, uses its own) and sends a ‘accept request’ message with the proposal number and value.
- When an Acceptor receives a ‘accept request’ message, it sends an ‘accept’ only if the following two conditions are met, otherwise it sends a ‘reject’:
- Value is same as any of the previously accepted proposals
- Seq number is the highest proposal number the Acceptor has agreed to
- If the Leader does not receive an ‘accept’ message from a majority, abandon the proposal and start again. However if the Leader does receive an ‘accept’ from a majority, the protocol can be considered terminated. As an optimization, the Leader may send ‘commit’ to the other nodes.
Paxos Failure Handling
What happens if we mandate only one Leader at a time in Paxos, and also mandate that instead of majority, all nodes must vote? You are right — we get 2PC. 2PC is a specific case of Paxos.
As one can see, Paxos is more failure tolerant than 2PC:
- Leader fails — another Leader can take over the protocol by issuing its own proposal.
- Original Leader recovers — two Leaders can co-exist thanks to the rules on agreeing only to higher numbered proposals and committing only previously accepted values.
Paxos is also more tolerant than 3PC. Specifically, Paxos is partition tolerant unlike 3PC. In 3PC, if two partitions separately agree on a value, when the partition merges back you are left in an inconsistent state. In Paxos, this does not arise because of the majority condition. Unless a partition has majority, it cannot come to a consensus. And if a partition has majority and comes to a consensus, that value would need to be accepted by the nodes in other partitions.
One issue with Paxos is that two Leaders, unable to observe each other because of partitioning, may try to out-bid one another by issuing proposals that have higher seq number than the previous proposal. This can lead to a situation where Paxos may not terminate. Eventually the two Leaders are expected to observe one another and one of them needs to back off.
This is a trade-off between safety and liveness. Paxos is a safe algorithm — once consensus is reached, the agreed value is not changed. However, Paxos is not guaranteed to be live — it may not terminate in some rare cases. In fact an asynchronous consensus algorithm cannot be guaranteed to be both safe and live. This is called the FLP Impossibility Result.
- Principles of Transaction Processing, Chapter 8 provides a detailed overview of the Two Phase Commit
- Non-blocking Commit Protocols — original paper by Dale Skeen that describes 3PC
- The Part-time Parliament — the original Paxos paper by Lamport. It uses a parliament analogy which people found hard to get past when the paper was originally published.
- Paxos Made Simple — the rewrite by Lamport without the Parliament analogy. While simple, one can miss the forest for the trees. Requires multiple readings to grok.
- Paxos Made Live — Google’s description of their Paxos implementation. The most readable of the Paxos papers.