简体   繁体   中英

Real world example of Paxos

Can someone give me a real-world example of how Paxos algorithm is used in a distributed database? I have read many papers on Paxos that explain the algorithm but none of them really explain with an actual example.

A simple example could be a banking application where an account is being modified through multiple sessions (ie a deposit at a teller, a debit operation etc..). Is Paxos used to decide which operation happens first? Also, what does one mean by multiple instances of Paxos protocol? How is when is this used? Basically, I am trying to understand all this through a concrete example rather than abstract terms.

For example, we have MapReduce system where master consists of 3 hosts. One is master and others are slaves. The procedure of choosing master uses Paxos algorithm.

Also Chubby of Google Big Table uses Paxos: The Chubby Lock Service for Loosely-Coupled Distributed Systems , Bigtable: A Distributed Storage System for Structured Data

The Clustrix database is a distributed database that uses Paxos in the transaction manager. Paxos is used by the database internals to coordinate messages and maintain transaction atomicity in a distributed system.

  • The Coordinator is the node the transaction originated on
  • Participants are the nodes that modified the database on behalf of
  • the transaction Readers are nodes that executed code on behalf of the transaction but did not modify any state
  • Acceptors are the nodes that log the state of the transaction.

The following steps are taken when performing a transaction commit:

  1. Coordinator sends a PREPARE message to each Participant.
  2. The Participants lock transaction state. They send PREPARED messages back to the Coordinator.
  3. Coordinator sends ACCEPT messages to Acceptors.
  4. The Acceptors log the membership id, transaction, commit id, and participants. They send ACCEPTED messages back to the Coordinator.
  5. Coordinator tells the user the commit succeeded.
  6. Coordinator sends COMMIT messages to each Participant and Reader.
  7. The Participants and Readers commit the transaction and update transaction state accordingly. They send COMMITTED messages back to the Coordinator.
  8. Coordinator removes internal state and is now done.

This is all transparent to the application and is implemented in the database internals. So for your banking application, all the application level would need to do is perform exception handling for deadlock conflicts. The other key to implementing a database at scale is concurrency, which is generally helped via MVCC (Multi-Version concurrency control).

Can someone give me a real-world example of how Paxos algorithm is used in a distributed database?

MySQL uses Paxos . This is why a highly available MySQL setup needs three servers. In contrast, a typical Postgres setup is a master-slave two-node configuration which isn't running Paxos.

I have read many papers on Paxos that explain the algorithm but none of them really explain with an actual example.

Here is a fairly detailed explanation of Paxos for transaction log replication . And here is the source code that implements it in Scala . Paxos (aka multi-Paxos) is optimally efficient in terms of messages as in a three node cluster, in steady state, the leader accepts it's own next value, transmits to both of the other two nodes, and knows the value is fixed when it gets back one response. It can then put the commit message (the learning message) into the front of the next value that it sends.

A simple example could be a banking application where an account is being modified through multiple sessions (ie a deposit at a teller, a debit operation etc..). Is Paxos used to decide which operation happens first?

Yes if you use a MySQL database cluster to hold the bank accounts then Paxos is being used to ensure that the replicas agree with the master as to the order that transactions were applied to the customer bank accounts. If all the nodes agree on the order that transactions were applied they will all hold the same balances.

Operations on a bank account cannot be reordered without coming up with different balances that may violate the business rules of not exceeding your credit. The trivial way to ensure the order is to just use one server process that decides the official order simply based on the order of the messages that it receives. It can then track the balances of each bank account and enforce the business rules. Yet you don't want just a single server as it may crash. You want replica servers that are also receiving the credit and debit commands and agree with the master.

The challenge with having replicas that should hold the same balances are that messages may be lost and resent and messages are buffered by switches that may deliver some messages late. The net effect is that if the network is unstable it is hard to prove that fast replication protocols will never cause different servers to see that the messages arrived in different orders. You will end up with different servers in the same cluster holding different balances.

You don't have to use Paxos to solve the bank accounts problem. You can just do simple master-slave replication. You have one master, one or more slaves, and the master waits until it has got acknowledgements from the slaves before telling any client the outcome of a command. The challenge there is lost and reordered messages. Before Paxos was invented database vendors just created expensive hardware designed to have very high redundancy and reliability to run master-slave. What was revolutionary about Paxos is that it does work with commodity networking and without specialist hardware.

Since banking applications were profitable with expensive custom hardware it is likely that many real-world banking systems are still running that way. In such scenarios, the database vendor supplies the specialist hardware with built-in reliable networking that the database software runs on. That is very expensive and not something that smaller companies want. Cost-conscious companies can set up a MySQL cluster on VMs in any public cloud with normal networking and Paxos will make it reliable rather than using specialist hardware.

Also, what does one mean by multiple instances of Paxos protocol? How is when is this used?

I wrote a blog about multi-Paxos being the original Paxos protocol . Simply put, in the case of choosing the order of transactions in a cluster, you want to stream the transactions as a stream of values. Each value is fixed in a separate logical instance of the protocol. As described in my blog about Paxos for cluster replication the algorithm is very efficient in steady-state needing only one round trip between the master and enough nodes to have a majority which is one other node in a three node cluster. When there are crashes or network issues the algorithm is always safe but needs more messages. So to answer your question typical applications need multiple rounds of Paxos to establish the order of client commands in the cluster.

I should note that Raft was specifically invented as a detailed description of how to perform cluster replication. The original Paxos papers require you to figure out many of the details to do cluster replication. So we can expect that people who are specifically trying to implement cluster replication would use Raft as it leaves nothing for the implementor to have to figure out for themselves.

So when might you use Paxos? It can be used to change the cluster membership of a cluster that is writing values based on a different protocol that can only be correct when you know the exact cluster membership. Corfu is a great example of that where it removes the bottleneck of writing via a single master by having clients write to shards of servers concurrently. Yet it can only do that accurately when all clients have an accurate view of the current cluster membership and shard layout. When nodes crash or you need to expand the cluster you propose a new cluster membership and shard layout and run it through Paxos to get consensus across the cluster.

Raft is a consensus algorithm that is designed to be easy to understand. It's equivalent to Paxos in fault-tolerance and performance. The difference is that it's decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems. Raft was meant to be more understandable than Paxos by means of separation of logic, but it is also formally proven safe and offers some additional features. Raft offers a generic way to distribute a state machine across a cluster of computing systems, ensuring that each node in the cluster agrees upon the same series of state transitions

Practical Uses On Raft

etcd is a strongly consistent, distributed key-value store that provides a reliable way to store data that needs to be accessed by a distributed system or cluster of machines. It gracefully handles leader elections during network partitions and can tolerate machine failure, even in the leader node. Applications of any complexity, from a simple web app to Kubernetes , can read data from and write data into etcd.

etcd is written in Go, which has excellent cross-platform support, small binaries and a great community behind it. Communication between etcd machines is handled via the Raft consensus algorithm.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM