简体繁体 English

Paxos 的真实世界示例

[英]Real world example of Paxos

原文 2012-05-08 19:08:32 9 4 algorithm/ distributed/ paxos/ consensus

Can someone give me a real-world example of how Paxos algorithm is used in a distributed database?有人能给我一个真实世界的例子，说明 Paxos 算法是如何在分布式数据库中使用的吗？ I have read many papers on Paxos that explain the algorithm but none of them really explain with an actual example.我已经阅读了许多关于 Paxos 的论文，它们解释了算法，但没有一篇真正用实际例子来解释。

A simple example could be a banking application where an account is being modified through multiple sessions (ie a deposit at a teller, a debit operation etc..).一个简单的例子可能是一个银行应用程序，其中一个帐户通过多个会话被修改（即在柜员处存款、借记操作等）。 Is Paxos used to decide which operation happens first? Paxos 是用来决定哪个操作先发生的吗？ Also, what does one mean by multiple instances of Paxos protocol?另外，Paxos 协议的多个实例是什么意思？ How is when is this used?这个什么时候用？ Basically, I am trying to understand all this through a concrete example rather than abstract terms.基本上，我试图通过一个具体的例子而不是抽象的术语来理解这一切。

4 个解决方案

For example, we have MapReduce system where master consists of 3 hosts.例如，我们有 MapReduce 系统，其中 master 由 3 个主机组成。 One is master and others are slaves.一个是主人，一个是奴隶。 The procedure of choosing master uses Paxos algorithm.选择master的过程使用Paxos算法。

Also Chubby of Google Big Table uses Paxos: The Chubby Lock Service for Loosely-Coupled Distributed Systems , Bigtable: A Distributed Storage System for Structured Data此外，Google Big Table的 Chubby使用 Paxos：松耦合分布式系统的 Chubby 锁服务， Bigtable：结构化数据的分布式存储系统

The Clustrix database is a distributed database that uses Paxos in the transaction manager. Clustrix数据库是一个分布式数据库，在事务管理器中使用 Paxos。 Paxos is used by the database internals to coordinate messages and maintain transaction atomicity in a distributed system.数据库内部使用 Paxos 来协调消息并维护分布式系统中的事务原子性。

The Coordinator is the node the transaction originated on协调器是发起交易的节点
Participants are the nodes that modified the database on behalf of参与者是代表修改数据库的节点
the transaction Readers are nodes that executed code on behalf of the transaction but did not modify any state交易读者是代表交易执行代码但没有修改任何状态的节点
Acceptors are the nodes that log the state of the transaction.接受者是记录交易状态的节点。

The following steps are taken when performing a transaction commit:执行事务提交时采取以下步骤：

Coordinator sends a PREPARE message to each Participant. Coordinator 向每个 Participant 发送 PREPARE 消息。
The Participants lock transaction state.参与者锁定交易状态。 They send PREPARED messages back to the Coordinator.他们将 PREPARED 消息发送回协调器。
Coordinator sends ACCEPT messages to Acceptors. Coordinator 向 Acceptor 发送 ACCEPT 消息。
The Acceptors log the membership id, transaction, commit id, and participants.接受者记录成员身份、交易、提交 id 和参与者。 They send ACCEPTED messages back to the Coordinator.他们将 ACCEPTED 消息发送回协调器。
Coordinator tells the user the commit succeeded.协调器告诉用户提交成功。
Coordinator sends COMMIT messages to each Participant and Reader. Coordinator 向每个 Participant 和 Reader 发送 COMMIT 消息。
The Participants and Readers commit the transaction and update transaction state accordingly.参与者和读者提交交易并相应地更新交易状态。 They send COMMITTED messages back to the Coordinator.他们将 COMMITTED 消息发送回协调器。
Coordinator removes internal state and is now done.协调器删除内部状态，现在完成。

This is all transparent to the application and is implemented in the database internals.这对应用程序来说都是透明的，并在数据库内部实现。 So for your banking application, all the application level would need to do is perform exception handling for deadlock conflicts.因此，对于您的银行应用程序，所有应用程序级别需要做的就是对死锁冲突执行异常处理。 The other key to implementing a database at scale is concurrency, which is generally helped via MVCC (Multi-Version concurrency control).大规模实现数据库的另一个关键是并发性，这通常通过 MVCC（多版本并发控制）提供帮助。

Can someone give me a real-world example of how Paxos algorithm is used in a distributed database?有人能给我一个真实世界的例子，说明 Paxos 算法是如何在分布式数据库中使用的吗？

MySQL uses Paxos . MySQL 使用 Paxos 。 This is why a highly available MySQL setup needs three servers.这就是高可用 MySQL 设置需要三台服务器的原因。 In contrast, a typical Postgres setup is a master-slave two-node configuration which isn't running Paxos.相比之下，典型的 Postgres 设置是不运行 Paxos 的主从双节点配置。

I have read many papers on Paxos that explain the algorithm but none of them really explain with an actual example.我已经阅读了许多关于 Paxos 的论文，它们解释了算法，但没有一篇真正用实际例子来解释。

Here is a fairly detailed explanation of Paxos for transaction log replication .这里是对 Paxos 进行事务日志复制的相当详细的解释。 And here is the source code that implements it in Scala .这是在 Scala中实现它的源代码。 Paxos (aka multi-Paxos) is optimally efficient in terms of messages as in a three node cluster, in steady state, the leader accepts it's own next value, transmits to both of the other two nodes, and knows the value is fixed when it gets back one response. Paxos（又名 multi-Paxos）在消息方面是最高效的，就像在三节点集群中一样，在稳定状态下，领导者接受自己的下一个值，传输到其他两个节点，并且知道该值是固定的得到一个回应。 It can then put the commit message (the learning message) into the front of the next value that it sends.然后它可以将提交消息（学习消息）放在它发送的下一个值的前面。

A simple example could be a banking application where an account is being modified through multiple sessions (ie a deposit at a teller, a debit operation etc..).一个简单的例子可能是一个银行应用程序，其中一个帐户通过多个会话被修改（即在柜员处存款、借记操作等）。 Is Paxos used to decide which operation happens first? Paxos 是用来决定哪个操作先发生的吗？

Yes if you use a MySQL database cluster to hold the bank accounts then Paxos is being used to ensure that the replicas agree with the master as to the order that transactions were applied to the customer bank accounts.是的，如果您使用 MySQL 数据库集群来保存银行账户，那么 Paxos 将用于确保副本与主服务器就交易应用于客户银行账户的顺序达成一致。 If all the nodes agree on the order that transactions were applied they will all hold the same balances.如果所有节点都同意应用交易的顺序，它们都将持有相同的余额。

Operations on a bank account cannot be reordered without coming up with different balances that may violate the business rules of not exceeding your credit.如果不提出可能违反不超过您的信用的业务规则的不同余额，则无法重新订购银行帐户的操作。 The trivial way to ensure the order is to just use one server process that decides the official order simply based on the order of the messages that it receives.确保顺序的一种简单方法是仅使用一个服务器进程，该进程仅根据它收到的消息的顺序来决定正式的顺序。 It can then track the balances of each bank account and enforce the business rules.然后它可以跟踪每个银行账户的余额并执行业务规则。 Yet you don't want just a single server as it may crash.然而，您不希望只需要一台服务器，因为它可能会崩溃。 You want replica servers that are also receiving the credit and debit commands and agree with the master.您希望副本服务器也接收信用和借记命令并同意主服务器。

The challenge with having replicas that should hold the same balances are that messages may be lost and resent and messages are buffered by switches that may deliver some messages late.拥有应该保持相同余额的副本的挑战是消息可能会丢失并重新发送，并且消息会被交换机缓冲，这些交换机可能会延迟交付一些消息。 The net effect is that if the network is unstable it is hard to prove that fast replication protocols will never cause different servers to see that the messages arrived in different orders.最终效果是，如果网络不稳定，则很难证明快速复制协议永远不会导致不同的服务器看到消息以不同的顺序到达。 You will end up with different servers in the same cluster holding different balances.您最终将在同一个集群中拥有不同的余额的不同服务器。

You don't have to use Paxos to solve the bank accounts problem.您不必使用 Paxos 来解决银行账户问题。 You can just do simple master-slave replication.你可以做简单的主从复制。 You have one master, one or more slaves, and the master waits until it has got acknowledgements from the slaves before telling any client the outcome of a command.您有一个主站，一个或多个从站，主站会等待，直到收到从站的确认，然后再告诉任何客户端命令的结果。 The challenge there is lost and reordered messages.那里的挑战是丢失和重新排序的消息。 Before Paxos was invented database vendors just created expensive hardware designed to have very high redundancy and reliability to run master-slave.在 Paxos 被发明之前，数据库供应商只是创建了昂贵的硬件，旨在具有非常高的冗余和可靠性来运行主从。 What was revolutionary about Paxos is that it does work with commodity networking and without specialist hardware. Paxos 的革命性之处在于它可以使用商品网络，无需专业硬件。

Since banking applications were profitable with expensive custom hardware it is likely that many real-world banking systems are still running that way.由于银行应用程序可以通过昂贵的定制硬件获利，因此许多现实世界的银行系统很可能仍在以这种方式运行。 In such scenarios, the database vendor supplies the specialist hardware with built-in reliable networking that the database software runs on.在这种情况下，数据库供应商通过内置的可靠网络为专业硬件提供运行数据库软件的网络。 That is very expensive and not something that smaller companies want.这是非常昂贵的，不是小公司想要的。 Cost-conscious companies can set up a MySQL cluster on VMs in any public cloud with normal networking and Paxos will make it reliable rather than using specialist hardware.注重成本的公司可以在具有正常网络的任何公共云中的 VM 上设置 MySQL 集群，Paxos 将使其可靠，而不是使用专业硬件。

Also, what does one mean by multiple instances of Paxos protocol?另外，Paxos 协议的多个实例是什么意思？ How is when is this used?这个什么时候用？

I wrote a blog about multi-Paxos being the original Paxos protocol .我写了一篇关于 multi-Paxos 作为原始 Paxos 协议的博客。 Simply put, in the case of choosing the order of transactions in a cluster, you want to stream the transactions as a stream of values.简单地说，在选择集群中事务顺序的情况下，您希望将事务作为值流进行流式传输。 Each value is fixed in a separate logical instance of the protocol.每个值都固定在协议的单独逻辑实例中。 As described in my blog about Paxos for cluster replication the algorithm is very efficient in steady-state needing only one round trip between the master and enough nodes to have a majority which is one other node in a three node cluster.正如我在关于用于集群复制的Paxos 的博客中所述，该算法在稳定状态下非常有效，只需要在主节点和足够多的节点之间进行一次往返即可拥有大多数节点，即三节点集群中的另一个节点。 When there are crashes or network issues the algorithm is always safe but needs more messages.当出现崩溃或网络问题时，算法总是安全的，但需要更多消息。 So to answer your question typical applications need multiple rounds of Paxos to establish the order of client commands in the cluster.所以要回答你的问题，典型的应用程序需要多轮 Paxos 来建立集群中客户端命令的顺序。

I should note that Raft was specifically invented as a detailed description of how to perform cluster replication.需要注意的是，Raft 是专门为详细描述如何执行集群复制而发明的。 The original Paxos papers require you to figure out many of the details to do cluster replication.最初的 Paxos 论文要求您弄清楚许多细节才能进行集群复制。 So we can expect that people who are specifically trying to implement cluster replication would use Raft as it leaves nothing for the implementor to have to figure out for themselves.所以我们可以期待那些专门尝试实现集群复制的人会使用 Raft，因为它没有让实现者自己弄清楚。

So when might you use Paxos?那么什么时候可以使用Paxos呢？ It can be used to change the cluster membership of a cluster that is writing values based on a different protocol that can only be correct when you know the exact cluster membership.它可用于更改基于不同协议写入值的集群的集群成员资格，该协议只有在您知道确切的集群成员资格时才能正确。 Corfu is a great example of that where it removes the bottleneck of writing via a single master by having clients write to shards of servers concurrently. Corfu就是一个很好的例子，它通过让客户端同时写入服务器分片来消除通过单个 master 写入的瓶颈。 Yet it can only do that accurately when all clients have an accurate view of the current cluster membership and shard layout.然而，只有当所有客户端都准确了解当前集群成员和分片布局时，它才能准确地做到这一点。 When nodes crash or you need to expand the cluster you propose a new cluster membership and shard layout and run it through Paxos to get consensus across the cluster.当节点崩溃或您需要扩展集群时，您建议一个新的集群成员资格和分片布局，并通过 Paxos 运行它以在整个集群中达成共识。

Raft is a consensus algorithm that is designed to be easy to understand. Raft是一种共识算法，旨在易于理解。 It's equivalent to Paxos in fault-tolerance and performance.它在容错性和性能上与Paxos相当。 The difference is that it's decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems.不同之处在于它被分解为相对独立的子问题，并且它干净地解决了实际系统所需的所有主要部分。 Raft was meant to be more understandable than Paxos by means of separation of logic, but it is also formally proven safe and offers some additional features.通过逻辑分离， Raft比Paxos更易于理解，但它也被正式证明是安全的，并提供了一些额外的功能。 Raft offers a generic way to distribute a state machine across a cluster of computing systems, ensuring that each node in the cluster agrees upon the same series of state transitions Raft 提供了一种跨计算系统集群分发状态机的通用方法，确保集群中的每个节点都同意相同的一系列状态转换

Practical Uses On Raft筏上的实际用途

etcd is a strongly consistent, distributed key-value store that provides a reliable way to store data that needs to be accessed by a distributed system or cluster of machines. etcd是一种高度一致的distributed key-value store ，它提供了一种可靠的方式来存储需要由分布式系统或机器集群访问的数据。 It gracefully handles leader elections during network partitions and can tolerate machine failure, even in the leader node.它在网络分区期间优雅地处理领导者选举，并且可以容忍机器故障，即使在领导节点中也是如此。 Applications of any complexity, from a simple web app to Kubernetes , can read data from and write data into etcd.任何复杂的应用程序，从简单的 Web 应用程序到Kubernetes ，都可以从Kubernetes读取数据并将数据写入到 etcd。

etcd is written in Go, which has excellent cross-platform support, small binaries and a great community behind it. etcd 是用 Go 编写的，它具有出色的跨平台支持、小型二进制文件和背后的强大社区。 Communication between etcd machines is handled via the Raft consensus algorithm. etcd 机器之间的通信是通过 Raft 共识算法处理的。