简体繁体 English

EC2服务器或AWS SimpleDB上的MongoDB？

[英]MongoDB on EC2 server or AWS SimpleDB?

原文 2010-08-02 20:43:45 9 3 mongodb/ amazon-simpledb

What scenario makes more sense - host several EC2 instances with MongoDB installed, or much rather use the Amazon SimpleDB webservice? 什么情况更有意义 - 安装了几个安装了MongoDB的EC2实例，或者更确切地说使用Amazon SimpleDB Web服务？

When having several EC2 instances with MongoDB I have the problem of setting the instance up by myself. 当有几个使用MongoDB的EC2实例时，我遇到了自己设置实例的问题。

When using SimpleDB I have the problem of locking me into Amazons data structure right? 使用SimpleDB时，我遇到了将我锁定到Amazons数据结构的问题吗？

What differences are there development-wise? 发展方面有什么不同？ Shouldn't I be able to just switch the DAO of my service layers, to either write to MongoDB or AWS SimpleDB? 我不应该只是切换服务层的DAO，写入MongoDB或AWS SimpleDB吗？

3 个解决方案

SimpleDB has some scalability limitations. SimpleDB具有一些可伸缩性限制。 You can only scale by sharding and it has higher latency than mongodb or cassandra, it has a throughput limit and it is priced higher than other options. 您只能通过分片进行扩展，并且它具有比mongodb或cassandra更高的延迟，它具有吞吐量限制，并且其定价高于其他选项。 Scalability is manual (you have to shard). 可伸缩性是手动的（您必须进行分片）。

If you need wider query options and you have a high read rate and you don't have so much data mongodb is better. 如果您需要更宽的查询选项并且您具有高读取率并且您没有那么多数据，那么mongodb会更好。 But for durability, you need to use at least 2 mongodb server instances as master/slave. 但是对于持久性，您需要使用至少2个mongodb服务器实例作为主/从。 Otherwise you can lose the last minute of your data. 否则，您可能会丢失数据的最后一分钟。 Scalability is manual. 可伸缩性是手动的。 It's much faster than simpledb. 它比simpledb快得多。 Autosharding is implemented in 1.6 version. Autosharding在1.6版本中实现。

Cassandra has weak query options but is as durable as postgresql. Cassandra具有较弱的查询选项，但与postgresql一样耐用。 It is as fast as mongo and faster on higher data size. 它与mongo一样快，在更高的数据大小上更快。 Write operations are faster than read operations on cassandra. 写操作比cassandra上的读操作更快。 It can scale automatically by firing ec2 instances, but you have to modify config files a bit (if I remember correctly). 它可以通过触发ec2实例自动扩展，但你必须稍微修改配置文件（如果我没记错的话）。 If you have terabytes of data cassandra is your best bet. 如果你有太字节数据cassandra是你最好的选择。 No need to shard your data, it was designed distributed from the 1st day. 无需对数据进行分片，它是从第1天开始分发的。 You can have any number of copies for all your data and if some servers are dead it will automatically return the results from live ones and distribute the dead server's data to others. 您可以为所有数据创建任意数量的副本，如果某些服务器已经死亡，它将自动从实时服务器返回结果并将死服务器的数据分发给其他服务器。 It's highly fault tolerant. 它具有高度的容错能力。 You can include any number of instances, it's much easier to scale than other options. 您可以包含任意数量的实例，它比其他选项更容易扩展。 It has strong .net and java client options. 它具有强大的.net和Java客户端选项。 They have connection pooling, load balancing, marking of dead servers,... 他们有连接池，负载平衡，死服务器标记，......

Another option is hadoop for big data but it's not as realtime as others, you can use hadoop for datawarehousing. 另一个选择是大数据的hadoop，但它不像其他人那样实时，你可以使用hadoop进行数据仓库。 Neither cassandra or mongo have transactions, so if you need transactions postgresql is a better fit. cassandra或mongo都没有交易，所以如果你需要交易，postgresql更合适。 Another option is Amazon RDS, but it's performance is bad and price is high. 另一个选择是亚马逊RDS，但它的性能很差，价格也很高。 If you want to use databases or simpledb you may also need data caching (eg: memcached). 如果要使用数据库或simpledb，则可能还需要数据缓存（例如：memcached）。

For web apps, if your data is small I recommend mongo, if it is large cassandra is better. 对于网络应用程序，如果您的数据很小，我建议mongo，如果它是大cassandra更好。 You don't need a caching layer with mongo or cassandra, they are already fast. 你不需要使用mongo或cassandra的缓存层，它们已经很快了。 I don't recommend simpledb, it also locks you to Amazon as you said. 我不推荐simpledb，它也像你说的那样将你锁定在亚马逊上。

If you are using c#, java or scala you can write an interface and implement it for mongo, mysql, cassandra or anything else for data access layer. 如果您正在使用c＃，java或scala，您可以编写一个接口并为mongo，mysql，cassandra或其他任何数据访问层实现它。 It's simpler in dynamic languages (eg rub,python,php). 它在动态语言中更简单（例如rub，python，php）。 You can write a provider for two of them if you want and can change the storage maybe in runtime by a only a configuration change, they're all possible. 如果需要，您可以为其中两个编写提供程序，并且可以在运行时通过仅更改配置来更改存储，它们都是可能的。 Development with mongo,cassandra and simpledb is easier than a database, and they are free of schema, it also depends on the client library/connector you're using. 使用mongo，cassandra和simpledb进行开发比数据库更容易，并且它们没有架构，它还取决于您正在使用的客户端库/连接器。 The simplest one is mongo. 最简单的是mongo。 There's only one index per table in cassandra, so you've to manage other indexes yourself, but with the 0.7 release of cassandra secondary indexes will bu possible as I know. cassandra中每个表只有一个索引，所以你要自己管理其他索引，但是如我所知，使用0.7版本的cassandra二级索引是可行的。 You can also start with any of them and replace it in the future if you have to. 如果必须，您也可以从其中任何一个开始并在将来替换它。

I think you have both a question of time and speed. 我想你既有时间又有速度的问题。

MongoDB / Cassandra are going to be much faster, but you will have to invest $$$ to get them going. MongoDB / Cassandra会更快，但你必须投资$$$来让他们继续前进。 This means you'll need to run / setup server instances for all them and figure out how they work. 这意味着您需要为所有这些实例运行/设置服务器实例，并弄清楚它们是如何工作的。

On the other hand, you don't have to per a "per transaction" cost directly, you just pay for the hardware which is probably more efficient for larger services. 另一方面，您不必直接按“每笔交易”成本，只需为硬件付费，这对于大型服务可能更有效。

In the Cassandra / MongoDB fight here's what you'll find (based on testing I'm personally involved with over the last few days). 在Cassandra / MongoDB的战斗中你会发现（根据过去几天我亲自参与的测试）。

Cassandra: 卡桑德拉：

Scaling / Redundancy is very core 扩展/冗余是非常核心的
Configuration can be very intense 配置可能非常激烈
To do reporting you need map-reduce, for that you need to run a hadoop layer. 要进行报告，您需要map-reduce，因为您需要运行hadoop层。 This was a pain to get configured and a bigger pain to get performant. 这是一个痛苦的配置和更大的痛苦，以获得高效。

MongoDB: MongoDB的：

Configuration is relatively easy (even for the new sharding, this week) 配置相对简单（即使是本周的新分片）
Redundancy is still "getting there" 冗余仍在“到达那里”
Map-reduce is built-in and it's easy to get data out. Map-reduce是内置的，很容易获取数据。

Honestly, given the configuration time required for our 10s of GBs of data, we went with MongoDB on our end. 老实说，考虑到我们的10s GB数据所需的配置时间，我们最终选择了MongoDB。 I can imagine using SimpleDB for "must get these running" cases. 我可以想象使用SimpleDB“必须得到这些运行”的情况。 But configuring a node to run MongoDB is so ridiculously simple that it may be worth skipping the "SimpleDB" route. 但是配置一个节点来运行MongoDB是如此简单，以至于跳过“SimpleDB”路由可能是值得的。

In terms of DAO, there are tons of libraries already for Mongo. 就DAO而言，Mongo已有大量的图书馆。 The Thrift framework for Cassandra is well supported. Cassandra的Thrift框架得到了很好的支持。 You can probably write some simple logic to abstract away connections. 您可以编写一些简单的逻辑来抽象出连接。 But it will be harder to abstract away things more complex than simple CRUD. 但是，抽象比简单CRUD更复杂的东西将更难。

Now 5 years later it is not hard to set up Mongo on any OS. 现在5年后，在任何操作系统上设置Mongo都不难。 Documentation is easy to follow, so I do not see setting up Mongo as a problem. 文档很容易理解，所以我没有看到将Mongo设置为问题。 Other answers addressed the questions of scalability, so I will try to address the question from the point of view of a developer (what limitations each system has): 其他答案解决了可伸缩性问题，因此我将尝试从开发人员的角度解决问题（每个系统有哪些限制）：

I will use S for SimpleDB and M for Mongo. 我将使用S代表SimpleDB，将M代表Mongo。

M is written in C++, S is written in Erlang (not the fastest language) M是用C ++编写的，S是用Erlang编写的（不是最快的语言）
M is open source, installed everywhere, S is proprietary, can run only on amazon AWS. M是开源的，安装在任何地方，S是专有的，只能在亚马逊AWS上运行。 You should also pay for a whole bunch of staff for S 你还应该为S 支付一大堆工作人员的费用
S has whole bunch of strange limitations . S有一大堆奇怪的局限。 M limitations are way more reasonable. M 限制更合理。 The most strange limitations are: 最奇怪的限制是：
- maximum size of domain (table) is 10 GB 域（表）的最大大小为10 GB
- attribute value length (size of field) is 1024 bytes 属性值长度（字段大小）是1024字节
- maximum items in Select response - 2500 选择响应中的最大项目数 - 2500
- maximum response size for Select (the maximum amount of data S can return you) - 1Mb Select的最大响应大小（S可以返回的最大数据量） - 1Mb
S supports only a few languages (java, php, python, ruby, .net), M supports way more S 只支持几种语言（java，php，python，ruby，.net），M 支持的方式更多
both support REST 都支持REST
S has a query syntax very similar to SQL (but way less powerful). S的查询语法与SQL非常相似（但功能不太强大）。 With M you need to learn a new syntax which looks like json (also it is straight-forward to learn the basics) 使用M，你需要学习一个看起来像json的新语法（也是直接学习基础知识）
with M you have to learn how you architect your database. 使用M，您必须了解如何构建数据库。 Because many people think that schemaless means that you can throw any junk in the database and extract this with ease, they might be surprised that Junk in, Junk out maxim works. 因为很多人认为无模式意味着你可以在数据库中抛出任何垃圾并轻松地提取它，他们可能会惊讶于Junk in，Junk out maxim工作。 I assume that the same is in S, but can not claim it with certainty. 我认为S中也是如此，但不能肯定地声称它。
both do not allow case insensitive search. 两者都不允许不区分大小写的搜索。 In M you can use regex to somehow (ugly/no index) overcome this limitation without introducing the additional lowercase field/application logic. 在M中，您可以使用正则表达式（丑陋/无索引）克服此限制，而无需引入额外的小写字段/应用程序逻辑。
in S sorting can be done only on one field 在S排序只能在一个字段上完成
because of 5s timelimit count in S can behave strange . 因为S中的5s timelimit 计数会表现得很奇怪。 If 5 seconds passed and the query has not finished, you end up with a partial number and a token which allows you to continue query. 如果5秒过去且查询尚未完成，您最终会得到一个部分号码和一个允许您继续查询的令牌。 Application logic is responsible for collecting all this data an summing up. 应用程序逻辑负责收集所有这些数据并进行总结。
everything is a UTF-8 string , which makes it a pain in the ass to work with non string values (like numbers, dates) in S. M type support is way richer . 一切都是UTF-8字符串，这使得在S中使用非字符串值（如数字，日期）很麻烦.M类支持更丰富。
both do not have transactions and joins 两者都没有交易和加入
M supports compression which is really helpful for nosql stores, where the same field name is stored all-over again. M支持压缩，这对nosql存储非常有用，其中相同的字段名称将全部存储起来。
S support just a single index, M has single, compound, multi-key, geospatial etc . S仅支持单个索引，M 具有单个，复合，多键，地理空间等。
both support replication and sharding 都支持复制和分片

One of the most important things you should consider is that SimpleDB has a very rudimentary query language. 您应该考虑的最重要的事情之一是SimpleDB有一个非常基本的查询语言。 Even basic things like group by , sum average , distinct as well as data manipulation is not supported, so the functionality is not really way richer than Redis/Memcached. 即使像基本的东西group by ， sum average ， distinct是不支持，以及数据处理，这样的功能是不是真的比Redis的/ Memcached的方式更加丰富。 On the other hand Mongo support a rich query language. 另一方面，Mongo支持丰富的查询语言。