简体繁体 English

MongoDB上的Sharding和复制之间的区别

[英]Difference between Sharding And Replication on MongoDB

原文 2013-11-01 02:45:08 0 5 mongodb/ replication/ sharding

I am just confuse about the Sharding and Replication that how they works..According to Definition 我只是混淆了它们如何工作的Sharding和复制......根据定义

Replication: A replica set in MongoDB is a group of mongod processes that maintain the same data set. 复制：MongoDB中的副本集是一组维护相同数据集的mongod进程。

Sharding: Sharding is a method for storing data across multiple machines. 分片：分片是一种跨多台机器存储数据的方法。

As per my understanding if there is data of 75 GB then by replication (3 servers), it will store 75GB data on each servers means 75GB on Server-1, 75GB on server-2 and 75GB on server-3..(correct me if i am wrong)..and by sharding it will be stored as 25GB data on server-1, 25Gb data on server-2 and 25GB data on server-3.(Right?)...but then i encountered this line in the tutorial 根据我的理解，如果有75 GB的数据然后通过复制（3台服务器），它将在每台服务器上存储75GB数据意味着服务器1上75GB，服务器2上75GB和服务器3上75GB。（纠正我如果我错了）..并且通过分片将它存储为服务器-1上的25GB数据，服务器-2上的25Gb数据和服务器-3上的25GB数据。（对吗？）...但是后来我遇到了这一行教程

Shards store the data. 碎片存储数据。 To provide high availability and data consistency, in a production sharded cluster, each shard is a replica set 为了提供高可用性和数据一致性，在生产分片集群中，每个分片都是副本集

As replica set is of 75GB but shard is of 25GB then how they can be equivalent...this makes me confuse a lot...I think i am missing something great in this. 副本设置为75GB但碎片为25GB，那么它们是如何相等的......这让我感到很困惑......我想我错过了一些很棒的东西。 Please help me in this. 请帮帮我。

5 个解决方案

Lets try with this analogy. 让我们尝试这个比喻。 You are running the library. 您正在运行该库。

As any person who has is running a library you have books in the library. 任何正在运行图书馆的人都会在图书馆中拥有图书。 You store all the books you have on the shelf. 您将所有书籍存放在书架上。 This is good, but your library became so good that your rival wants to burn it. 这很好，但你的图书馆变得非常好，你的对手想要烧它。 So you decide to make many additional shelves in other places. 因此，您决定在其他地方制作许多额外的货架。 There is one the most important shelf and whenever you add some new books you quickly add the same books to other shelves. 有一个最重要的架子，无论何时添加一些新书，您都可以快速将相同的书籍添加到其他书架。 Now if the rival destroys a shelf - this is not a problem, you just open another one and copy it with the books. 现在，如果竞争对手摧毁了一个架子 - 这不是问题，你只需打开另一个并将其与书籍一起复制即可。

This is replication (just substitute library with application, shelf with a server, book with a document in the collection and your rival is just failed HDD on the server). 这是复制（只需用应用程序替换库，带有服务器的架子，带有文档的书籍，你的竞争对手只是服务器上的硬盘故障）。 It just makes additional copies of the data and if something goes wrong it automatically selects another primary. 它只是制作了额外的数据副本，如果出现问题，它会自动选择另一个主数据库。

This concept may help if you 如果你这个概念可能有所帮助

want to scale reads (but they might lag behind the primary). 想要扩大读数（但它们可能落后于初级读数）。
do some offline reads which do not touch main server 做一些不接触主服务器的离线读取
serve some part of the data for a specific region from a server from that specific region 从特定区域的服务器为特定区域提供部分数据
But the main reason behind replication is data availability. 但复制背后的主要原因是数据可用性。 So here you are right: if you have 75Gb of data and replicate it with 2 secondaries - you will get 75*3 Gb of data. 所以在这里你是对的：如果你有75Gb的数据并用2个辅助数据复制它 - 你将获得75 * 3 Gb的数据。

Look at another scenario. 看看另一个场景。 There is no rival so you do not want to make copy of your shelves. 没有竞争对手，所以你不想复制你的货架。 But right now you have another problem. 但是现在你还有另外一个问题。 You became so good that one shelf is not enough. 你变得如此优秀以至于一个架子还不够。 You decide to distribute your books between many shelves. 您决定在多个货架之间分发您的图书。 You decide to distribute them between shelves based on the author name (this is not be a good idea and read how to select sharding key here). 您决定根据作者姓名在架子之间分发它们（这不是一个好主意，并在此处阅读如何选择分片键）。 So everything that starts with name less then K goes to one shelf everything that is K and more goes to another. 因此，所有以名称少于K开头的东西都会进入一个架子，一切就是K，而更多就是另一个架子。 This is sharding . 这是分片。

This concept may help you: 这个概念可以帮助您：

distribute a workload 分配工作量
be able to save data which much more then can fit on a single server 能够保存更多，然后可以放在单个服务器上的数据
do map-reduce things 做地图 - 减少事情
store more data in ram for faster queries 在ram中存储更多数据以加快查询速度

Here you are partially correct. 在这里你部分正确。 If you have 75Gb, then in sum on all the servers there will be still 75 Gb, but it does not necessarily be divided equally. 如果你有75Gb，那么在所有服务器上总和仍然会有75 Gb，但它不一定是平均分配的。

But here is a problem with only sharding . 但这是一个只有分片的问题 。 Right now your rival appeared and he just came to one of your shelves and burned it. 现在你的竞争对手出现了，他只是来到你的一个货架上烧了它。 All the data on that shelf is lost. 该架子上的所有数据都将丢失。 So you want to replicate every shard as well. 所以你也希望复制每个分片。 Basically the notion that 基本上是这个概念

each shard is a replica set 每个分片都是副本集

is not true. 不是真的。 But if you are doing sharding you have to create a replication for every shard. 但是如果要进行分片，则必须为每个分片创建一个复制。 Because the more shards you have, the bigger is the probability that at least one will die. 因为你拥有的碎片越多，至少有一个碎片的概率就越大。

Answering Saad's followup answer: 回答萨阿德的后续回答：

Also you can have shards and replicas together on the same server, it is not recommended way of doing it. 您也可以在同一台服务器上同时拥有分片和副本，不建议这样做。 Each server should have a single role in the system. 每个服务器在系统中应该只有一个角色。 If for example you decide to have 2 shards and to replicate it 3 times, you will end up with 6 machines. 例如，如果您决定使用2个分片并将其复制3次，那么最终将有6台机器。

I know that this might sound too costly, but you have to remember that this is a commodity hardware and if the service you providing is already so good, that you think about high availability and does not fit one machine, then this is a rather cheap price to pay (in comparison to a dedicated one big machine). 我知道这可能听起来太昂贵，但你必须记住这是一个商品硬件，如果你提供的服务已经很好，你考虑高可用性而不适合一台机器，那么这是一个相当便宜支付的价格（与专用的一台大机器相比）。

I am writing it as an answer but actually its a question to @Salvador Sir's answer. 我正在写它作为答案，但实际上它是对@Salvador Sir的答案的问题。

Like you said that in sharding 75 GB data "may be" stored as 25GB data on server-1, 25GB on server-2 and 25Gb on server-3. 就像你说的那样，在分片75 GB中，数据“可以”存储为服务器-1上的25GB数据，服务器2上25GB，服务器3上25GB。 (this distribution depends on the Sharding Key)...then to prevent it from loss we also need to replicate the shard. （此分布取决于Sharding Key）...然后为了防止它丢失，我们还需要复制分片。 so this means now every server contains it shards and also the replication of other shards present on other server..means Server-1 will have 所以这意味着现在每个服务器都包含它的分片以及其他服务器上存在的其他分片的复制。意思是Server-1将具有

1) Its own shard. 1）它自己的碎片。

2) Replication of Shard present on server-2 2）服务器-2上存在的碎片的复制

3) Replication of Shard present on server-3 3）服务器-3上存在的碎片的复制

same goes with Server-2 and server-3. 与Server-2和server-3相同。 Am i right?..if this is the case then each server again have 75GB of data again. 我是对的吗？..如果是这种情况，那么每个服务器再次拥有75GB的数据。 Right or wrong? 对还是错？

Since we want to make 3 shards and also replicate the data so following is the solution to the above problem. 因为我们想要制作3个分片并复制数据，所以以下是解决上述问题的方法。

r has shard and also replica set then in that case the failure of that server will lead to loss of replica set and shard. r有shard和副本集然后在那种情况下该服务器的失败将导致副本集和分片的丢失。

However you can have the shard 1 and replica set (replica of shard 2 and shard 3) on same server but this is not advisable.. 但是，您可以在同一台服务器上设置分片1和副本集（分片2和分片3的副本），但这不可取。

Sharding is like partition of data. 分片就像分区数据。 Lets say you have around 3GB of data, and you defined 3 shards, So each shard MIGHT take 1GB of data(And it truly depends on the shard key) Why sharding is needed? 假设您有大约3GB的数据，并且您定义了3个分片，因此每个分片可能需要1GB的数据（而且它真的取决于分片键）为什么需要分片？ Searching a specific data out of 3GB is 3 times complex than searching in 1GB of data. 从3GB中搜索特定数据比在1GB数据中搜索要复杂3倍。 So its almost similar to partition. 所以它几乎与分区类似。 And sharding helps for fast accessing of data. 分片有助于快速访问数据。

Now coming to Replica, Lets say you have the same 3GB of data without any replication(That means only a single copy of data exists) so if anything happens to that machine or the drive, your data is gone. 现在来到Replica，让我们说你有相同的3GB数据没有任何复制（这意味着只存在一个数据副本）所以如果该机器或驱动器发生任何事情，你的数据就会消失。 So replication comes into picture to solve this problem, Lets say when you set up the DB, you have given your Replication as 3, which means the same 3GB of data is available 3 times(So the total size could be 9GB divided by each of 3GB copies). 所以复制就可以解决这个问题，让我们说当你设置数据库时，你已经将你的Replication复制为3，这意味着相同的3GB数据可用3次（所以总大小可以是9GB除以每个3GB拷贝）。 Replication helps for fail over. 复制有助于故障转移。