[英]MySQL Cluster vs. Hadoop for handling big data
I want to know the advantages/disadvantages of using a MySQL Cluster and using the Hadoop framework. 我想知道使用MySQL群集和使用Hadoop框架的优缺点。 What is the better solution.
有什么更好的解决方案。 I would like to read your opinion.
我想听听你的意见。
I think the advantages of using a MySQL Cluster are: 我认为使用MySQL群集的优点是:
And I don't see a disadvantage! 而且我没有看到缺点! Are there any disadvantages that Hadoop do not has?
Hadoop没有任何缺点吗?
The advantages of Hadoop with Hive on top of it are: 在Hive之上的Hadoop的优点是:
and the disadvantage is: 缺点是:
So in my opinion for handling big data a MySQL cluster is the better solution. 因此,在我看来,对于处理大数据,MySQL群集是更好的解决方案。 Why Hadoop is the holy grail of handling big data?
为什么Hadoop是处理大数据的圣杯? What is your opinion?
你有什么意见?
Both of the above answers miss a huge differentiation between mySQL and Hadoop. 以上两个答案都错过了mySQL和Hadoop之间的巨大区别。 mySQL requires you to store data in a certain format.
MySQL要求您以某种格式存储数据。 It likes heavily structured data - you declare the data type of each column in a table etc. Hadoop doesn't care about this at all.
它喜欢结构化的数据-您在表等中声明每一列的数据类型。Hadoop完全不关心这一点。
Example - if you have a billion text log files, to make analysis even possible for mySQL you'd need to parse and load the data first into a mySQL table, typeing each column along the way. 示例-如果您有十亿个文本日志文件,则即使要对mySQL进行分析,也需要首先解析数据并将其加载到mySQL表中,并在此过程中键入每一列。 With hadoop and mapreduce, you define the function that is to scan/analyze/return the data from its raw source - you don't need pre-processing ETL to get it pre-structured.
使用hadoop和mapreduce,您可以定义从原始数据中扫描/分析/返回数据的功能-无需预处理ETL即可对其进行预构建。
If the data is already structured and in mySQL - then (hopefully) its well structured - why export it for hadoop to analyze? 如果数据已经结构化并且在mySQL中-那么(希望)其结构良好-为什么将其导出以供hadoop分析? If it isn't, why spend the time to ETL the data?
如果不是,为什么要花时间对数据进行ETL?
Hadoop is not a replacement of MySQL, so I think they have their own scenario。 Hadoop不能替代MySQL,因此我认为它们有自己的场景。
Every one know hadoop is better for batch job or offline compute, but there also have many related real time product, such as hbase. 每个人都知道hadoop更适合批处理作业或脱机计算,但也有许多相关的实时产品,例如hbase。
If you wanna choose a offline compute & storage arch. 如果您想选择离线计算和存储架构。
I suggest hadoop not MySQL cluster for offline compute & storage, because of : 我建议使用hadoop而不是MySQL集群进行离线计算和存储,原因是:
So you can choose hadoop as offline compute & storage and MySQL as online compute & storage, you also can learn more from lambda architecture . 因此,您可以选择hadoop作为离线计算和存储,而选择MySQL作为在线计算和存储,还可以从lambda体系结构中了解更多信息。
The other answer is good, but doesn't really explain why hadoop is more scalable for offline data crunching than MySQL Clusters. 另一个答案很好,但是并不能真正解释为什么hadoop在脱机数据处理方面比MySQL Cluster更可扩展。 Hadoop is more efficient for large data sets that must be distributed across many machines because it gives you full control over the sharding of data.
对于必须在许多计算机上分布的大型数据集,Hadoop效率更高,因为它使您可以完全控制数据分片。
MySQL clusters use auto-sharding, and it's designed to randomly distribute the data so no one machine gets hit with more of the load. MySQL群集使用自动分片,它旨在随机分配数据,因此,没有一台机器会遭受更多的负载。 On the other hand, Hadoop allows you to explicitly define the data partition so that multiple data points that require simultaneous access will be on the same machine, minimizing the amount of communication among the machines necessary to get the job done.
另一方面,Hadoop允许您显式定义数据分区,以便需要同时访问的多个数据点将位于同一台计算机上,从而最大程度地减少了完成工作所需的计算机之间的通信量。 This makes Hadoop better for processing massive data sets in many cases.
在许多情况下,这使Hadoop更适合处理海量数据集。
The answer to this question has a good explanation of this distinction. 这个问题的答案很好地解释了这种区别。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.