简体   繁体   中英

How do replicas coming back online in PAXOS or RAFT catch up?

In consensus algorithms like for example PAXOS and RAFT, a value is proposed, and if a quorum agrees, it's written durably to the data store. What happens to the participants that were unavailable at the time of the quorum? How do they eventually catch up? This seems to be left as an exercise for the reader wherever I look.

Take a look at the Raft protocol. It's simply built in to the algorithm. If the leader tracks the highest index ( matchIndex ) and the nextIndex to be sent to each follower, and the leader always sends entries to each follower starting at that follower's nextIndex , there is no special case needed to handle catching up a follower that was missing when the entry was committed. By its nature, when the restarts, the leader will always begin sending entries to that follower starting with the last entry in its log. Thus the node is caught up.

With the original Paxos papers, it is indeed left as an exercise for the reader. In practice, with Paxos you can send additional messages such as negative acknowledgements to propagate more information around the cluster as a performance optimisation. That can be used to let a node know that it is behind due to lost messages. Once a node knows that it is behind it needs to catch up which can be done with additional message types. That is described as Retransmission in the Trex multi-paxos engine that I wrote to demystify Paxos .

The Google Chubby paxos paper Paxos Made Live criticises Paxos for leaving a lot up to the people doing the implementation. Lamport trained as a mathematician and was attempting to mathematically prove that you couldn't have consensus over lossy networks when he found the solution. The original papers are very much supplying a proof it is possible rather than explaining how to build practical systems with it. Modern papers usually describe an application of some new techniques backed up by some experimental results, while they also supply a formal proof, IMHO most people skip over it and take it on trust. The unapproachable way that Paxos was introduced means that many people who quote the original paper but have failed to see that they describe leader election and multi-Paxos . Unfortunately, Paxos is still taught in a theoretical manner, not in a modern style which leads people to think that it is hard and miss the essence of it.

I argue that Paxos is simple but that reasoning about failures in a distributed system and testing to find any bugs is hard. Everything that is left the reader in the original papers doesn't affect correctness but does effect latency, throughput and the complexity of the code. Once you understand what makes Paxos correct as it is mechanically simple it makes it straightforward to write the rest of what is needed in a way that doesn't violate consistency when you optimise the code for your use case and workload.

For example, Corfu and CURP give blisteringly high performance, one uses Paxos only for metadata, the other only needs to do Paxos when there are concurrent writes to the same keys. Those solutions don't directly complete with Raft or Multi-Paxos as they solve for specific high-performance scenarios (eg, kv stores). Yet they demonstrate that it's worth understanding that for practical applications there is a huge amount of optimisations you can make if your particular workload will let you while still using Paxos for some part of the overall solution.

This is mentioned in Paxos made Simple:

Because of message loss, a value could be chosen with no learner ever finding out. The learner could ask the acceptors what proposals they have accepted, but failure of an acceptor could make it impossible to know whether or not a majority had accepted a particular proposal. In that case, learners will find out what value is chosen only when a new proposal is chosen. If a learner needs to know whether a value has been chosen, it can have a proposer issue a proposal, using the algorithm described above.

And also in Raft paper:

The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower.


If a follower's log is inconsistent with the leader's, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower's log and appends entries from the leader's log (if any).


If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM