Introduction
Paxos is a distributed consensus algorithm developed by Lamport. It is proved optimal and many systems are built based on it like chubby and zookeeper.
But this article is not going to discuss Lamport’s orginal paper but focus on the engineering implementations. My colleague highly recommended Ongaro’s lecture of Paxos and said it is the best source of learning Paxos. I cannot agree more after studying it. That being said, I will briefly talk about Paxos and Multi-Paxos and dive into the engineering implementations.
Multi-Paxos & Paxos
In short, Multi-Paxos is simply the multiple use of Paxos. Paxos is developed by Lamport in 1998 for the distributed consensus problem. There are three agents: proposer, acceptor and listener in Paxos. They can be the same server or separate server depending on the design. In Ongaro’s lecture, the listener is part of acceptor thus it ends up with two agents proposer and acceptor (In Lamport’s Paxos Made Simple, a server could be proposer, acceptor and lisenter at the same time). To chose an value, proposer is using a 2-phase protocal to lock down an entry in the acceptors’ logs.
With Paxos, we can meet the requirements of safety and liveness:
- Safety: Nothing bad is going to happen - at least one value is chosen
- Liveness: Eventually a good thing is going to happen - one value is going to be chosen eventually
- To avoid livelock, we could use exponential backoff or leader election in Muliti-Paxos
Only proposer knows which value has been chosen, other proposers must execute Paxos with their own proposal to get the chosen value
As said in the beginning, Multi-Paxos is the multiple use of Paxos which means we will use Paxos to choose a value for a log entry every time the client make request. Thus Dr. Ongaro pointed out several issues of Multi-Paxos and suggested solutions:
- Which log entry to use for a given client request?
- When
jmp
command is sent to the servers1
,s1
need to track the variablefirstUnchosenIndex
- If the
firstUnchosenIndex
already has a value,s1
will send the proposal using the existing value and incrementfirstUnchosenIndex
once the value is chosen - Then
s1
will start from the beginning until thefirstUnchosenIndex
has no value thens1
makes proposal fromjmp
command - The server could handle the client requests concurrently
- Suppose we have multiple requests
- We can try each unchosen index for one independent request
- So we would need a collection of
UnchosenIndex
- Then the application of log entry should be sequential in the state trasaction machine.
- When
- Paxos is slow because of 2-phase protocal and could have intense competition with several proposers (livelock)
- We should use leader-election to ensure a single Proposer at any given time
- Bully algorithm - the server with the largest ID is the leader
- Each server sends the heartbeat to other servers every
T
ms - If a server hasn’t recevied the heartbeat from server with higher ID in last
2T
ms, it is a leader- Accept requests from clients
- Acts as proposer and acceptor
- The rest servers are followers
- Redirected request to leader
- Acts as acceptor
- Each server sends the heartbeat to other servers every
- Bully algorithm - the server with the largest ID is the leader
- We could send proposal msg to the entire log thus all the log entries would have a global proposal number which can block all old proposals
- Then we should collect
highestProposal
accepcted for current entry andnoMoreAccepted
bool from the propose requestnoMoreAccepted
: no more accepcted values after the current entry- If the leader recevies
noMoreAccepted
from the majority acceptors, the leader doesn’t have to send any prepare requests and only need accept requests
- Then we should collect
- We should use leader-election to ensure a single Proposer at any given time
- How to ensure the full replication across servers?
- Left issues:
- Log entries not fully replicated
- Only proposer knows when entry is chosen
- Solutions:
- Keep send accept request until all acceptors respond
- mark entires chosen infinite to prevent overwritten
- each server should track
firstUnchosenIndex
- proposer include
firstUnchosenIndex
in its accept requests - acceptor should mark all entries chosen if
- the entires are lower than
firstUnchosenIndex
- and the entries have the same proposal number
- the entires are lower than
- proposer include
- acceptor could consult with the leader about unknown index
- acceptor includes its
firstUnchosenIndex
in accept replies - if proposer’s
fristUnchosenIndex
is larger then it sends the success request to acceptor to confirm the chosen status- and acceptor keep returning the
firstUnchosenIndex
till it catches up
- and acceptor keep returning the
- acceptor includes its
- Left issues:
- How to deal with fail-stop issues from client perspective?
- The client could submit the same request multiple times thus we need idempotent key associated with each request
- How to update the configuration of servers?
- Configuration change
- impacts
- the number of servers determines
majority
- quorum size
- the number of servers determines
- $\alpha$ parameter
- save the configuration as normal log entries
- configuration is treated like any other CURD commands
- save the configuration at the entry i
- then all the entry i + $\alpha$ will use the configuration at entry i if there is configuration file/command
- e.g. if we updated the configuration, the configuration file will be stored at index $i$ based on paxos
- and everytime the leader should check the position $i$ to see if there is a configuration file
- if yes, the configuration should be updated
- otherwise, the existing configuration should be used
- if $\alpha$ is too small, the system wind up as a synchronous system
- if $\alpha$ is too large, the update of configuration could take a while
- we could use no-op command to quickly update the log entires between $i$ and $i+\alpha$ so configuration could update
- save the configuration as normal log entries
Engineering Impl
This section will summaries the details of the RPC between proposer and acceptors.
Server
- Each server has its own logs to maintain and we need make sure the logs are consistent across all servers
- Leader election: only leader acts as Proposer to replicate the log entries
- State
- log
- value
- proposal
- mark the entries known to be chosen
- each server maintains
firstUnchosenIndex
- log
- Message
- proposer -> acceptor
- prepare msg
- global proposal number
- index
- value
- accept msg
- for acceptor sent
noMoreAccepted
, only need send accept msg - once the majority acceptor sent noMoreAccepted, only need send accept msg - keep retrying until all acceptors respond -{proposal: p; index: i; value: v; firstUnchosenIndex: u_i;}
- mark all entries before firstUnchoseIndex equal to proposal as chosen - put the proposal and value at index i of log - acceptor -> proposer
- response msg
noMoreAccepted
firstUnchosenIndex
- proposer will compare its
firstUnchosenIndex
with accpetor’sfirstUnchosenIndex
- if >: the propser will send Success msg to acceptor to confirm the entry is took
{index: i; value: v}
- then acceptor will updated
acceptedValue[i]
andacceptedProposal[i]
- and return another response contains the updated firstUnchosenIndex
- if >: the propser will send Success msg to acceptor to confirm the entry is took
- proposer will compare its
- response msg
Client
Any possible command like CRUD operations on database.
gRPC Service
- Proposer
- Prepare Msg
- Proposal
- Accept Msg
- Proposal
- Index
- Value
- FirstUnchosenIndex
- Success Msg
- Index
- Value
- Prepare Msg
- Acceptor
- Response Msg
- FirstUnchosenIndex
- noMoreUnaccepted
- Response Msg
- Client
- Command like CRUD
The code can be found at [here](ydeng11/Multi-Paxos (github.com)).