简体繁体 English

用于数据分析的NoSql或MySQL

[英]NoSql or MySQL for Data Analytics

原文 2011-10-15 21:49:16 9 4 mysql/ nosql/ hive

We have a cluster (hadoop, pig) which churns data 350Gb (growing couple of GB a week). 我们有一个集群（hadoop，pig），它可以生成数据350Gb（每周增长几GB）。

All these data need to be made available for Analytics. 所有这些数据都需要提供给Google Analytics。

We have a Msyql solution with star schema(only parts of data is loaded on to this). 我们有一个带有星型模式的Msyql解决方案（只有部分数据加载到此）。 But 但

concern is how far one can stretch this ? 关注的是人们可以伸展多远？

Should I be looking at NoSQL like Hive for data analytics ?? 我应该像Hive那样关注NoSQL进行数据分析吗？

I read this article http://anders.com/cms/282/Distributed.Data/Hadoop/Hbase/Hive 我读了这篇文章http://anders.com/cms/282/Distributed.Data/Hadoop/Hbase/Hive

How big is big Data, and when should I be looking away from MySQL? 大数据有多大，何时我应该远离MySQL？ Will the structural rigidness of Mysql cause problems ? Mysql的结构刚性会导致问题吗？

Currently the data is only few GB(in MySQL), But it certainly will grow. 目前数据只有几GB（在MySQL中），但它肯定会增长。 How about MySQL clustering ?? MySQL集群怎么样？

Should I be going down this path at all ?? 我应该走这条路吗？

4 个解决方案

350Gb (growing couple of GB a week)... All these data need to be made available for Analytics 350Gb（每周增长几GB）...所有这些数据都需要提供给Google Analytics

Do you have MySQL gurus in house? 你有内部MySQL专家吗？ If yes, sure => just create and grow that MySQL cluster. 如果是，确定=>只需创建和扩展MySQL集群。 The only problem with this solution is not that it is MySQL, and it is not that it is not a NoSQL => it is literally because it requires an expert to set it up and always be there by your side in case it needs to be changed. 这个解决方案的唯一问题不是它是MySQL，并不是它不是 NoSQL =>它实际上是因为它需要专家来设置它并且总是在你身边，以防它需要改变。 But guess what => SQL is MUCH better and simpler for analytics, than a map/reduc'ish SQL simulation. 但你猜=> SQL是什么好多了，简单的分析，比地图/ reduc'ish SQL模拟。

Something that can become a problem later with MySQL solution is Oracle . 以后使用MySQL解决方案可能会成为问题的是Oracle 。 So make sure you understand what features of MySQL you can use for free, and what features you would have to pay for. 因此，请确保您了解可以免费使用的MySQL功能，以及您需要支付的功能。

If you do not have a MySQL expert in house, or you would not like to pay for one, you can definitely turn to NoSQL. 如果你没有内部的MySQL专家，或者你不想支付一个，你绝对可以转向NoSQL。 It does not mean that you would not need a NoSQL product expertise though, but to configure and run X nodes as a single system is an extremely simple and natural process for NoSQL solutions. 这并不意味着您不需要NoSQL产品专业知识，但是将X节点配置和运行为单个系统对于NoSQL解决方案来说是一个非常简单和自然的过程。

For example, in Riak, and a couple of other NoSQL beasts, most of the distribution complexities are solved by the product without you needing to do anything at all => it really is that simple. 例如，在Riak和其他几个NoSQL野兽中，大多数分发复杂性都是由产品解决的，而你根本不需要做任何事情=>它真的很简单。

The price you pay with NoSQL is losing SQL (think about nice aggregating features) and consistency, which is eventual , and if you strictly doing analytics, for you, consistency may not be a price at all. 你用NoSQL支付的价格正在失去SQL（考虑好的聚合功能）和一致性，这是最终的 ，如果你严格做分析，对你来说，一致性可能根本不是一个价格。

In return you get a very natural Big Data handling, fault tolerance and much more . 作为回报，您将获得非常自然的大数据处理，容错等等。

If you are in Hadooooxyz space, and you are okay to pay, take a look at Hadapt , which promises 5 times Hive performance. 如果你在Hadooooxyz空间，你可以付钱，看看Hadapt ，它承诺5次Hive性能。

The question is of course now many months old, but... I recently came across InfiniDB, which puts a MySQL front end on a highly scalable, MapReduce-based Big Data engine aimed specifically at analytics. 问题当然是好几个月了，但是......我最近遇到了InfiniDB，它将MySQL前端放在一个高度可扩展的基于MapReduce的大数据引擎上，专门用于分析。 It may be a solution for this problem-- in principle it should drop in and require very little administration and few code changes. 它可能是这个问题的解决方案 - 原则上它应该是插入并且需要很少的管理和很少的代码更改。 Scaling up on one box or out on multiple servers is supported... 支持在一个盒子上扩展或在多个服务器上扩展...

InfiniDB is not free. InfiniDB不是免费的。

Check out http://code.google.com/p/shard-query 查看http://code.google.com/p/shard-query

This is like Map-Reduce over a sharded shared-nothing set of databases. 这就像是一个分片无共享数据库的Map-Reduce。 Works great for STAR schemas. 适用于STAR架构。 Shard the fact table over N nodes and duplicate the dimension tables on each server. 在N个节点上对事实表进行分片，并在每个服务器上复制维度表。

You can check out this blog post for more info and performance testing results: 您可以查看此博客文章以获取更多信息和性能测试结果：

http://www.mysqlperformanceblog.com/2011/05/06/scale-out-mysql/ http://www.mysqlperformanceblog.com/2011/05/06/scale-out-mysql/

FYI: I'm the author of Shard-Query. 仅供参考：我是Shard-Query的作者。

You switch when you start having the kinds of problems outlined in something like this comparative question: https://dba.stackexchange.com/questions/5/what-are-the-differences-between-nosql-and-a-traditional-rdbms 当您开始遇到类比问题中列出的各种问题时，请切换： https ： //dba.stackexchange.com/questions/5/what-are-the-differences-between-nosql-and-a-traditional- RDBMS

Other than that, it's a little difficult to answer the question beyond general advice, because you don't pose a specific problem that you are trying to solve (eg scaling, read speed, the problems with requiring 100% consistency, etc.). 除此之外，回答一般建议之外的问题有点困难，因为您没有提出您要解决的特定问题（例如缩放，读取速度，需要100％一致性的问题等）。