简体   繁体   中英

In Paxos, why can't we use random backoff to avoid collision?

I understand that the heart of Paxos consensus algorithm is that there is only one "majority" in any given set of nodes, therefore if a proposer gets accepted by a majority, there cannot be another majority that accepts a different value, given that any acceptor can only accept 1 single value.

So the simplest "happy path" of a consensus algorithm is just for any proposer to ping a majority of acceptors and see if it can get them to accept its value, and if so, we're done.

The collision comes when concurrent proposers leads to a case where no majority of nodes agrees on a value, which can be demonstrated with the simplest case of 3 nodes, and every node tries to get 2 nodes to accept its value but due to concurrency, every node ends up only get itself to "accept" the value, and therefore no majority agrees on anything.

Paxos algorithm continues to invent a 2-phase algorithm to solve this problem.

But why can't we just simply backoff a random amount of time and retry, until eventually one proposer will succeed in grabbing a majority opinion? This can be demonstrated to succeed eventually , since every proposer will backoff a random amount of time if it fails to grab a majority.

I understand that this is not going to be ideal in terms of performance. But let's get performance out of the way first and only look at the correctness. Is there anything I'm missing here? Is this a correct (basic) consensus algorithm at all?

The designer of paxos is a Mathematician first, and he leaves the engineering to others.

As such, Paxos is designed for the general case to prove consensus is always safe, irrespective of any message delays or colliding back-offs.

And now the sad part. The FLP impossibility result is a proof that any system with this guarantee may run into an infinite loop.

Raft is also designed with this guarantee and thus the same mathematical flaw .

But, the author of Raft also made design choices to specialize Paxos so that an engineer could read the description and make a well-functioning system.

One of these design choices is the well-used trick of exponential random backoff to get around the FLP result in a practical way. This trick does not take away the mathematical possibility of an infinite loop, but does make its likelihood extremely, ridiculously, very small.

You can tack on this trick to Paxos itself, and get the same benefit (and as a professional Paxos maintainer, believe me we do), but then it is not Pure Paxos.

Just to reiterate, the Paxos protocol was designed to be in its most basic form SO THAT mathematicians could prove broad statements about consensus. Any practical implementation details are left to the engineers.


Here is a case where a liveness issue in RAFT caused a 6-hour outage: https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/ .


Note 1: Yes, I said that the Raft author specialized Paxos. Raft can be mapped onto the more general Vertical Paxos model, which in turn can be mapped onto the Paxos model. As can any system that implements consensus.


Note 2: I have worked with Lamport a few times. He is well aware of these engineering tricks, and he assumes everyone else is, too. Thus he focuses on the math of the problem in his papers, and not the engineering.

The logic you are describing is how leader election is implemented in Raft:

  • when there is no leader (or leader goes offline) every node will have a random delay
  • after the random delay, the node will contact every other node and propose "let me be the leader"
  • if the node gets the majority of votes, then the node considers itself the leader: which is equivalent of saying "the cluster got the consensus on who is the leader"
  • if the node did not get the majority, then after a timeout and a random delay, the node will attempt again

Raft also has a concept of term, but on a high level, the randomized waits is the feature with helps to get to consensus faster.

Answering your questions "why can't we..." - we can, it will be a different protocol.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM