简体繁体 English

MySQL Cluster与Hadoop处理大数据

[英]MySQL Cluster vs. Hadoop for handling big data

原文 2014-01-29 02:23:49 1 3 hadoop/ mapreduce/ hive/ bigdata/ mysql-cluster

I want to know the advantages/disadvantages of using a MySQL Cluster and using the Hadoop framework. 我想知道使用MySQL群集和使用Hadoop框架的优缺点。 What is the better solution. 有什么更好的解决方案。 I would like to read your opinion. 我想听听你的意见。

I think the advantages of using a MySQL Cluster are: 我认为使用MySQL群集的优点是：

high availability 高可用性
good scalability 良好的可扩展性
high performance / real time data access 高性能/实时数据访问
you can use commodity hardware 您可以使用商品硬件

And I don't see a disadvantage! 而且我没有看到缺点！ Are there any disadvantages that Hadoop do not has? Hadoop没有任何缺点吗？

The advantages of Hadoop with Hive on top of it are: 在Hive之上的Hadoop的优点是：

also good scalability 扩展性也不错
you can also use commodity hardware 您也可以使用商品硬件
the ability to run in heterogenous environments 在异构环境中运行的能力
parallel computing with the MapReduce framework 使用MapReduce框架进行并行计算
Hive with HiveQL Hive与HiveQL

and the disadvantage is: 缺点是：

no real time data access. 没有实时数据访问。 It may takes minutes or hours to analyze the data. 分析数据可能需要几分钟或几小时。

So in my opinion for handling big data a MySQL cluster is the better solution. 因此，在我看来，对于处理大数据，MySQL群集是更好的解决方案。 Why Hadoop is the holy grail of handling big data? 为什么Hadoop是处理大数据的圣杯？ What is your opinion? 你有什么意见？

3 个解决方案

Both of the above answers miss a huge differentiation between mySQL and Hadoop. 以上两个答案都错过了mySQL和Hadoop之间的巨大区别。 mySQL requires you to store data in a certain format. MySQL要求您以某种格式存储数据。 It likes heavily structured data - you declare the data type of each column in a table etc. Hadoop doesn't care about this at all. 它喜欢结构化的数据-您在表等中声明每一列的数据类型。Hadoop完全不关心这一点。

Example - if you have a billion text log files, to make analysis even possible for mySQL you'd need to parse and load the data first into a mySQL table, typeing each column along the way. 示例-如果您有十亿个文本日志文件，则即使要对mySQL进行分析，也需要首先解析数据并将其加载到mySQL表中，并在此过程中键入每一列。 With hadoop and mapreduce, you define the function that is to scan/analyze/return the data from its raw source - you don't need pre-processing ETL to get it pre-structured. 使用hadoop和mapreduce，您可以定义从原始数据中扫描/分析/返回数据的功能-无需预处理ETL即可对其进行预构建。

If the data is already structured and in mySQL - then (hopefully) its well structured - why export it for hadoop to analyze? 如果数据已经结构化并且在mySQL中-那么（希望）其结构良好-为什么将其导出以供hadoop分析？ If it isn't, why spend the time to ETL the data? 如果不是，为什么要花时间对数据进行ETL？

Hadoop is not a replacement of MySQL, so I think they have their own scenario。 Hadoop不能替代MySQL，因此我认为它们有自己的场景。

Every one know hadoop is better for batch job or offline compute, but there also have many related real time product, such as hbase. 每个人都知道hadoop更适合批处理作业或脱机计算，但也有许多相关的实时产品，例如hbase。

If you wanna choose a offline compute & storage arch. 如果您想选择离线计算和存储架构。

I suggest hadoop not MySQL cluster for offline compute & storage, because of : 我建议使用hadoop而不是MySQL集群进行离线计算和存储，原因是：

Cost : obviously, hadoop cluster is more cheap than MySQL cluster 成本：显然，Hadoop集群比MySQL集群便宜
Scalability : hadoop support more than ten thousands machine in a cluster 可扩展性：hadoop支持集群中超过一万台机器
Ecosystem : mapreduce, hive, pig, sqoop and etc. 生态系统：mapreduce，蜂巢，猪，鱿鱼等

So you can choose hadoop as offline compute & storage and MySQL as online compute & storage, you also can learn more from lambda architecture . 因此，您可以选择hadoop作为离线计算和存储，而选择MySQL作为在线计算和存储，还可以从lambda体系结构中了解更多信息。

The other answer is good, but doesn't really explain why hadoop is more scalable for offline data crunching than MySQL Clusters. 另一个答案很好，但是并不能真正解释为什么hadoop在脱机数据处理方面比MySQL Cluster更可扩展。 Hadoop is more efficient for large data sets that must be distributed across many machines because it gives you full control over the sharding of data. 对于必须在许多计算机上分布的大型数据集，Hadoop效率更高，因为它使您可以完全控制数据分片。

MySQL clusters use auto-sharding, and it's designed to randomly distribute the data so no one machine gets hit with more of the load. MySQL群集使用自动分片，它旨在随机分配数据，因此，没有一台机器会遭受更多的负载。 On the other hand, Hadoop allows you to explicitly define the data partition so that multiple data points that require simultaneous access will be on the same machine, minimizing the amount of communication among the machines necessary to get the job done. 另一方面，Hadoop允许您显式定义数据分区，以便需要同时访问的多个数据点将位于同一台计算机上，从而最大程度地减少了完成工作所需的计算机之间的通信量。 This makes Hadoop better for processing massive data sets in many cases. 在许多情况下，这使Hadoop更适合处理海量数据集。

The answer to this question has a good explanation of this distinction. 这个问题的答案很好地解释了这种区别。