简体繁体 English

在 PAXOS 或 RAFT 中重新上线的副本如何赶上？

[英]How do replicas coming back online in PAXOS or RAFT catch up?

原文 2019-03-19 19:58:39 1 3 distributed-computing/ consensus/ paxos/ raft

In consensus algorithms like for example PAXOS and RAFT, a value is proposed, and if a quorum agrees, it's written durably to the data store.在诸如 PAXOS 和 RAFT 之类的共识算法中，会提出一个值，如果法定人数同意，则将其持久地写入数据存储。 What happens to the participants that were unavailable at the time of the quorum?在达到法定人数时无法参加的参与者会怎样？ How do they eventually catch up?他们最终如何赶上？ This seems to be left as an exercise for the reader wherever I look.无论我在哪里看，这似乎都留给读者作为练习。

3 个解决方案

Take a look at the Raft protocol.看一下 Raft 协议。 It's simply built in to the algorithm.它只是内置在算法中。 If the leader tracks the highest index ( matchIndex ) and the nextIndex to be sent to each follower, and the leader always sends entries to each follower starting at that follower's nextIndex , there is no special case needed to handle catching up a follower that was missing when the entry was committed.如果领导者跟踪要发送给每个追随者的最高索引（ matchIndex ）和下一个nextIndex ，并且领导者总是从追随者的nextIndex开始向每个追随者发送条目，则不需要处理追赶丢失的追随者的特殊情况提交条目时。 By its nature, when the restarts, the leader will always begin sending entries to that follower starting with the last entry in its log.从本质上讲，当重新启动时，领导者将始终从其日志中的最后一个条目开始向该跟随者发送条目。 Thus the node is caught up.因此节点被赶上。

With the original Paxos papers, it is indeed left as an exercise for the reader.对于原始的 Paxos 论文，它确实留给读者作为练习。 In practice, with Paxos you can send additional messages such as negative acknowledgements to propagate more information around the cluster as a performance optimisation.在实践中，使用 Paxos，您可以发送额外的消息，例如否定确认，以在集群周围传播更多信息，作为性能优化。 That can be used to let a node know that it is behind due to lost messages.这可用于让节点知道它由于丢失消息而落后。 Once a node knows that it is behind it needs to catch up which can be done with additional message types.一旦一个节点知道它落后了，它就需要赶上这可以通过其他消息类型来完成。 That is described as Retransmission in the Trex multi-paxos engine that I wrote to demystify Paxos .这被描述为 Trex multi- paxos引擎中的 Retransmission，我编写该引擎是为了揭开 Paxos的神秘面纱。

The Google Chubby paxos paper Paxos Made Live criticises Paxos for leaving a lot up to the people doing the implementation. Google Chubby paxos 论文Paxos Made Live批评 Paxos 将很多事情留给了执行人员。 Lamport trained as a mathematician and was attempting to mathematically prove that you couldn't have consensus over lossy networks when he found the solution. Lamport 接受过数学培训，并试图在数学上证明当他找到解决方案时，您无法就有损网络达成共识。 The original papers are very much supplying a proof it is possible rather than explaining how to build practical systems with it.原始论文在很大程度上提供了一个证明它是可能的，而不是解释如何用它来构建实际系统。 Modern papers usually describe an application of some new techniques backed up by some experimental results, while they also supply a formal proof, IMHO most people skip over it and take it on trust.现代论文通常描述由一些实验结果支持的一些新技术的应用，同时它们也提供正式的证明，恕我直言，大多数人跳过它并相信它。 The unapproachable way that Paxos was introduced means that many people who quote the original paper but have failed to see that they describe leader election and multi-Paxos .引入 Paxos 的不可接近的方式意味着许多引用原始论文但没有看到他们描述领导选举和多 Paxos 的人。 Unfortunately, Paxos is still taught in a theoretical manner, not in a modern style which leads people to think that it is hard and miss the essence of it.不幸的是，Paxos 仍然是以理论的方式教授的，而不是现代的风格，这导致人们认为它很难，错过了它的本质。

I argue that Paxos is simple but that reasoning about failures in a distributed system and testing to find any bugs is hard.我认为Paxos 很简单，但对分布式系统中的故障进行推理并进行测试以发现任何错误却很困难。 Everything that is left the reader in the original papers doesn't affect correctness but does effect latency, throughput and the complexity of the code.原始论文中留给读者的所有内容都不会影响正确性，但会影响延迟、吞吐量和代码的复杂性。 Once you understand what makes Paxos correct as it is mechanically simple it makes it straightforward to write the rest of what is needed in a way that doesn't violate consistency when you optimise the code for your use case and workload.一旦您了解了 Paxos 正确的原因，因为它在机械上很简单，就可以在为您的用例和工作负载优化代码时以不违反一致性的方式直接编写所需的其余部分。

For example, Corfu and CURP give blisteringly high performance, one uses Paxos only for metadata, the other only needs to do Paxos when there are concurrent writes to the same keys.例如， Corfu和CURP提供了非常高的性能，一个仅将 Paxos 用于元数据，另一个仅在对相同键进行并发写入时才需要使用 Paxos。 Those solutions don't directly complete with Raft or Multi-Paxos as they solve for specific high-performance scenarios (eg, kv stores).这些解决方案不能直接与 Raft 或 Multi-Paxos 一起完成，因为它们解决了特定的高性能场景（例如，kv 存储）。 Yet they demonstrate that it's worth understanding that for practical applications there is a huge amount of optimisations you can make if your particular workload will let you while still using Paxos for some part of the overall solution.然而，它们表明值得理解的是，对于实际应用程序，如果您的特定工作负载允许您在整体解决方案的某些部分仍然使用 Paxos，您可以进行大量优化。

This is mentioned in Paxos made Simple: Paxos made Simple 中提到了这一点：

Because of message loss, a value could be chosen with no learner ever finding out.由于消息丢失，可以在没有学习者发现的情况下选择一个值。 The learner could ask the acceptors what proposals they have accepted, but failure of an acceptor could make it impossible to know whether or not a majority had accepted a particular proposal.学习者可以询问接受者他们接受了哪些建议，但接受者的失败可能会导致无法知道大多数人是否接受了特定的建议。 In that case, learners will find out what value is chosen only when a new proposal is chosen.在这种情况下，学习者只会在选择新提案时才知道选择了什么值。 If a learner needs to know whether a value has been chosen, it can have a proposer issue a proposal, using the algorithm described above.如果学习者需要知道一个值是否被选择，它可以让提议者使用上述算法发出提议。

And also in Raft paper:还有在 Raft 纸上：

The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower.领导者为每个追随者维护一个 nextIndex，这是领导者将发送给该追随者的下一个日志条目的索引。

If a follower's log is inconsistent with the leader's, the AppendEntries consistency check will fail in the next AppendEntries RPC.如果一个 follower 的 log 与 leader 的不一致，那么 AppendEntries 一致性检查将在下一个 AppendEntries RPC 中失败。 After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC.拒绝后，领导者减少 nextIndex 并重试 AppendEntries RPC。 Eventually nextIndex will reach a point where the leader and follower logs match.最终 nextIndex 将达到领导者和追随者日志匹配的点。 When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower's log and appends entries from the leader's log (if any).发生这种情况时，AppendEntries 将成功，这将删除跟随者日志中的任何冲突条目并附加领导者日志中的条目（如果有）。

If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail.如果跟随者或候选者崩溃，那么未来发送给它的 RequestVote 和 AppendEntries RPC 将失败。 Raft handles these failures by retrying indefinitely; Raft 通过无限期重试来处理这些失败； if the crashed server restarts, then the RPC will complete successfully.如果崩溃的服务器重新启动，那么 RPC 将成功完成。