简体   繁体   English

用于数据分析的NoSql或MySQL

[英]NoSql or MySQL for Data Analytics

We have a cluster (hadoop, pig) which churns data 350Gb (growing couple of GB a week). 我们有一个集群(hadoop,pig),它可以生成数据350Gb(每周增长几GB)。

All these data need to be made available for Analytics. 所有这些数据都需要提供给Google Analytics。

We have a Msyql solution with star schema(only parts of data is loaded on to this). 我们有一个带有星型模式的Msyql解决方案(只有部分数据加载到此)。 But

concern is how far one can stretch this ? 关注的是人们可以伸展多远?

Should I be looking at NoSQL like Hive for data analytics ?? 我应该像Hive那样关注NoSQL进行数据分析吗?

I read this article http://anders.com/cms/282/Distributed.Data/Hadoop/Hbase/Hive 我读了这篇文章http://anders.com/cms/282/Distributed.Data/Hadoop/Hbase/Hive

How big is big Data, and when should I be looking away from MySQL? 大数据有多大,何时我应该远离MySQL? Will the structural rigidness of Mysql cause problems ? Mysql的结构刚性会导致问题吗?

Currently the data is only few GB(in MySQL), But it certainly will grow. 目前数据只有几GB(在MySQL中),但它肯定会增长。 How about MySQL clustering ?? MySQL集群怎么样?

Should I be going down this path at all ?? 我应该走这条路吗?

350Gb (growing couple of GB a week)... All these data need to be made available for Analytics 350Gb(每周增长几GB)...所有这些数据都需要提供给Google Analytics

Do you have MySQL gurus in house? 你有内部MySQL专家吗? If yes, sure => just create and grow that MySQL cluster. 如果是,确定=>只需创建和扩展MySQL集群。 The only problem with this solution is not that it is MySQL, and it is not that it is not a NoSQL => it is literally because it requires an expert to set it up and always be there by your side in case it needs to be changed. 这个解决方案的唯一问题不是它是MySQL,并不是它不是 NoSQL =>它实际上是因为它需要专家来设置它并且总是在你身边,以防它需要改变。 But guess what => SQL is MUCH better and simpler for analytics, than a map/reduc'ish SQL simulation. 但你猜=> SQL是什么好多了,简单的分析,比地图/ reduc'ish SQL模拟。

Something that can become a problem later with MySQL solution is Oracle . 以后使用MySQL解决方案可能会成为问题的是Oracle So make sure you understand what features of MySQL you can use for free, and what features you would have to pay for. 因此,请确保您了解可以免费使用的MySQL功能,以及您需要支付的功能。

If you do not have a MySQL expert in house, or you would not like to pay for one, you can definitely turn to NoSQL. 如果你没有内部的MySQL专家,或者你不想支付一个,你绝对可以转向NoSQL。 It does not mean that you would not need a NoSQL product expertise though, but to configure and run X nodes as a single system is an extremely simple and natural process for NoSQL solutions. 这并不意味着您不需要NoSQL产品专业知识,但是将X节点配置和运行为单个系统对于NoSQL解决方案来说是一个非常简单和自然的过程。

For example, in Riak, and a couple of other NoSQL beasts, most of the distribution complexities are solved by the product without you needing to do anything at all => it really is that simple. 例如,在Riak和其他几个NoSQL野兽中,大多数分发复杂性都是由产品解决的,而你根本不需要做任何事情=>它真的很简单。

The price you pay with NoSQL is losing SQL (think about nice aggregating features) and consistency, which is eventual , and if you strictly doing analytics, for you, consistency may not be a price at all. 你用NoSQL支付的价格正在失去SQL(考虑好的聚合功能)和一致性,这是最终的 ,如果你严格做分析,对你来说,一致性可能根本不是一个价格。

In return you get a very natural Big Data handling, fault tolerance and much more . 作为回报,您将获得非常自然的大数据处理,容错等等

If you are in Hadooooxyz space, and you are okay to pay, take a look at Hadapt , which promises 5 times Hive performance. 如果你在Hadooooxyz空间,你可以付钱,看看Hadapt ,它承诺5次Hive性能。

The question is of course now many months old, but... I recently came across InfiniDB, which puts a MySQL front end on a highly scalable, MapReduce-based Big Data engine aimed specifically at analytics. 问题当然是好几个月了,但是......我最近遇到了InfiniDB,它将MySQL前端放在一个高度可扩展的基于MapReduce的大数据引擎上,专门用于分析。 It may be a solution for this problem-- in principle it should drop in and require very little administration and few code changes. 它可能是这个问题的解决方案 - 原则上它应该是插入并且需要很少的管理和很少的代码更改。 Scaling up on one box or out on multiple servers is supported... 支持在一个盒子上扩展或在多个服务器上扩展...

InfiniDB is not free. InfiniDB不是免费的。

Check out http://code.google.com/p/shard-query 查看http://code.google.com/p/shard-query

This is like Map-Reduce over a sharded shared-nothing set of databases. 这就像是一个分片无共享数据库的Map-Reduce。 Works great for STAR schemas. 适用于STAR架构。 Shard the fact table over N nodes and duplicate the dimension tables on each server. 在N个节点上对事实表进行分片,并在每个服务器上复制维度表。

You can check out this blog post for more info and performance testing results: 您可以查看此博客文章以获取更多信息和性能测试结果:

http://www.mysqlperformanceblog.com/2011/05/06/scale-out-mysql/ http://www.mysqlperformanceblog.com/2011/05/06/scale-out-mysql/

FYI: I'm the author of Shard-Query. 仅供参考:我是Shard-Query的作者。

You switch when you start having the kinds of problems outlined in something like this comparative question: https://dba.stackexchange.com/questions/5/what-are-the-differences-between-nosql-and-a-traditional-rdbms 当您开始遇到类比问题中列出的各种问题时,请切换: https//dba.stackexchange.com/questions/5/what-are-the-differences-between-nosql-and-a-traditional- RDBMS

Other than that, it's a little difficult to answer the question beyond general advice, because you don't pose a specific problem that you are trying to solve (eg scaling, read speed, the problems with requiring 100% consistency, etc.). 除此之外,回答一般建议之外的问题有点困难,因为您没有提出您要解决的特定问题(例如缩放,读取速度,需要100%一致性的问题等)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM