简体   繁体   English

hadoop会比MySQL快吗

[英]Will hadoop be faster than mySQL

I am facing a big data problem. 我面临着一个大数据问题。 I have a large MySQL (Percona) table which joins on itself once a day and produces about 25 billion rows. 我有一个很大的MySQL(Percona)表,该表每天自动连接一次,并产生约250亿行。 I am trying to to group together and aggregate all the rows to produce a result. 我试图将它们组合在一起并汇总所有行以产生结果。 The query is a simple join: 该查询是一个简单的联接:

--This query produces about 25 billion rows
SELECT t1.colA as 'varchar(45)_1', t2.colB as 'varchar(45)_2', count(*)
FROM table t1
JOIN
table t2
on t1.date = t2.date
GROUP BY t1.colA, t2.colB

The problem is this process takes more than a week to complete. 问题在于此过程需要一周以上的时间才能完成。 I have started reading about hadoop and wondering if the map reduce feature can improve the amount of time to process the data. 我已经开始阅读有关hadoop的文章,并想知道地图缩减功能是否可以改善处理数据的时间。 I noticed HIVE is a nice little add-on to allow SQL like queries for hadoop. 我注意到,HIVE是一个不错的小插件,可以让SQL像对hadoop的查询一样。 This all looks very promising, but I am facing an issue where I will only be running on a single machine: 这一切看起来都非常有希望,但是我面临的问题是我只能在单台计算机上运行:

6-core i7-4930K
16GB RAM
128 SSD
2TB HDD 

When I run the query with MySQL, my resources are barley being used, only about 4Gb of ram and one core is only working 100% while the other are working close to 0%. 当我使用MySQL运行查询时,我的资源正在使用大麦,只有大约4Gb的ram,一个内核仅工作100%,而另一个则接近0%。 I checked into this and found MySQL is single threaded. 我检查了一下,发现MySQL是单线程的。 This is also why Hadoop seems to be promising as I noticed it can run multiple mapper functions to better utilize my resources. 这也是为什么Hadoop似乎很有前途的原因,因为我注意到它可以运行多个映射器功能来更好地利用我的资源。 My question remains is hadoop able to replace MySQL in my situation in which it can produce results within a few hours opposed to over a week even though hadoop will only be running on a single node (although I know it is meant for distributed computing)? 我的问题仍然是hadoop是否能够在我的情况下替换MySQL,即即使hadoop仅在单个节点上运行,它也可以在几个小时内(而不是一个星期)内产生结果(尽管我知道这是针对分布式计算的)?

Some very large hurdles for you are going to be that hadoop is really meant to run on a cluster and not a single server. 对于您而言,一些非常大的障碍是hadoop实际上是要在群集而不是单个服务器上运行。 It can make use of multiple cores but the amount of resources that it will consume will be very significant. 它可以使用多个内核,但是它将消耗的资源量将非常可观。 I have a single system that I use for testing that has hadoop and hbase. 我有一个用于hadoop和hbase的测试系统。 It has namenode, secondary name node, data node, nodemanager, resourcemanager, zookeeper etc running. 它具有正在运行的名称节点,辅助名称节点,数据节点,节点管理器,资源管理器,动物园管理员等。 This is a very heavy load for a single system. 对于单个系统而言,这是非常重的负载。 Plus HIVE is not a true SQL compliant replacement for a RDBMS so it has to emulate some of the work by creating map/reduce jobs. 另外,HIVE不是真正的RDBMS兼容SQL替代品,因此它必须通过创建映射/减少作业来模拟某些工作。 These jobs are considerably more disk intensive and use the hdfs file system for mapping the data into virtual tables (verbage may vary). 这些作业占用大量磁盘,并且使用hdfs文件系统将数据映射到虚拟表中(详细信息可能有所不同)。 HDFS also has a fairly significant overhead due to the fact that the filesystem is meant to be spread over many systems. HDFS还具有相当大的开销,这是因为文件系统本应分布在许多系统上。

With that said I would not recommend solving your problem with Hadoop. 话虽如此,我不建议您使用Hadoop解决问题。 I would recommend checking out what it has to offer though in the future. 我建议您日后检查一下它所提供的功能。

Have you looked into sharding the data which can take advantage of multiple processors. 您是否研究过分片可以利用多个处理器的数据。 IMHO this would be a much cleaner solution. 恕我直言,这将是一个更清洁的解决方案。

http://www.percona.com/blog/2014/05/01/parallel-query-mysql-shard-query/ http://www.percona.com/blog/2014/05/01/parallel-query-mysql-shard-query/

You might also look into testing postgres. 您可能还会考虑测试postgres。 It has very good parallel query support built in. 它具有很好的内置并行查询支持。

Another idea is you may look into trying an olap cube to do the calculations and it can rebuild the indexes on the fly so that only changes will be taken into affect. 另一个想法是,您可能会尝试使用olap多维数据集进行计算,并且它可以即时重建索引,以便仅将更改生效。 Due to the fact that you are really dealing with data analytics this may be an ideal solution. 由于您实际上正在处理数据分析,因此这可能是理想的解决方案。

Hadoop is not a magic bullet. Hadoop不是灵丹妙药。

Whether anything is faster in Hadoop than in MySQL is mostly a question of how well your abilities to write Java code (for mappers and reducers in Hadoop) or SQL are... 在Hadoop中,是否有比在MySQL中更快的速度,这主要是关于您编写​​Java代码(对于Hadoop中的映射器和化简器)或SQL的能力的问题。

Usually, Hadoop shines when you have a problem running well on a single host, and need to scale it up to 100 hosts at the same time. 通常,当您在单个主机上无法正常运行时遇到问题,并且需要同时将其扩展到100个主机时,Hadoop会大放异彩。 It is not the best choice if you have a single computer only; 如果只有一台计算机,这不是最佳选择。 because it essentially communicates via disk . 因为它本质上是通过磁盘进行通信的 Writing to disk is not the best way to do communication. 写入磁盘不是进行通信的最佳方法。 The reason why it is popular in distributed systems is crash recovery. 它在分布式系统中流行的原因是崩溃恢复。 But you cannot benefit from this: if you lose your single machine, you lost everything, even with Hadoop. 但是您不能从中受益:如果您丢失了单台计算机,那么即使使用Hadoop,也将丢失所有内容。

Instead: 代替:

  1. figure out if you are doing the right thing . 弄清楚你做正确的事情 There is nothing worse than spending time to optimize a computation that you do not need. 没有比花费时间优化不需要的计算更糟糕的了。 Consider working on a subset, to first figure out whether you are doing the right thing at all... (chances are, there is something fundamentally broken with your query in the first place!) 考虑处理一个子集,首先要弄清楚您是否做对了所有事情……(可能是,您的查询从根本上有一些问题!)

  2. optimize your SQL. 优化您的SQL。 Use multiple queries to split the workload. 使用多个查询来拆分工作负载。 Reuse earlier results, instead of computing them again. 重用以前的结果,而不是再次计算它们。

  3. reduce your data. 减少您的数据。 A query that is expected to return 25 billion must be expected to be slow! 预期返回250亿个查询的速度一定很慢! It's just really inefficient to produce results this size. 产生如此大小的结果真的是效率低下。 Choose a different analysis, and double-check that you are doing the right computation; 选择其他分析,然后再次检查您是否在进行正确的计算; because most likely you aren't; 因为你很可能不是 but you are doing much to much work. 但是您正在做很多工作。

  4. build optimal partitions. 建立最佳分区。 Partition you data by some key, and put each date into a separate table, database, file, whatever, ... then process the joins one such partition at a time (or if you have good indexes on your database, just query one key at a time)! 用某个键对数据进行分区,然后将每个日期放入单独的表,数据库,文件等中,然后...一次处理一个这样的分区的联接(或者,如果数据库上有良好的索引,只需查询一个键一次)!

Yes you are right MySQL is single threaded ie 1 thread per query. 是的,您是对的MySQL是单线程的,即每个查询1个线程。
Having 1 machine only I don't think it will help you much because you may utilize the cores but you will have contention over I/O since all threads will try to access the disk. 仅拥有一台计算机,我认为这不会对您有多大帮助,因为您可以利用这些内核,但是由于所有线程都将尝试访问磁盘,因此您将对I / O产生争执。
The number of rows you mentioned are a lot but you have not mentioned the actual size of your table on disk. 您提到的行数很多,但是您没有提到磁盘上表的实际大小。
How big is your table actually? 您的桌子实际上有多大? (In bytes on HD I mean) (以高清字节为单位)
Also you have not mentioned if the date column is indexed. 另外,您也没有提到日期列是否已建立索引。 It could help you if you removed the t2.colB or removed the GROUP BY all together. 如果您删除了t2.colB或一起删除了GROUP BY,则可能会对您有所帮助。
GROUP BY does sorting and in your case it isn't good. GROUP BY会进行排序,这对您而言并不理想。 You could try to do the group by in your application. 您可以尝试在应用程序中进行分组。
Perhaps you should tell us what exactly are you trying to achieve with your query. 也许您应该告诉我们您要通过查询实现的目标。 May be there is a better way to do it. 也许有更好的方法可以做到这一点。

I had a similarly large query and was able to take advantage of all cores by breaking up my query into multiple smaller ones and running them concurrently. 我有一个类似的大型查询,并且能够通过将查询分解为多个较小的查询并同时运行它们来利用所有内核。 Perhaps you could do the same. 也许您可以这样做。 Instead of one large query that processes all dates, you could run two (or N) queries that process a subset of dates and write the results into another table. 您可以运行两个(或N个)查询来处理日期的子集,然后将结果写到另一个表中,而不是一个处理所有日期的大查询。

ie if your data spanned from 2012 to 2013 即如果您的数据跨度为2012年至2013年

SELECT INTO myResults (colA,colB,colC)
SELECT t1.colA as 'varchar(45)_1', t2.colB as 'varchar(45)_2', count(*)
FROM table t1
JOIN table t2 on t1.date = t2.date
WHERE t1.date BETWEEN '2012-01-01' AND '2012-12-31'
GROUP BY t1.colA, t2.colB

SELECT INTO myResults (colA,colB,colC)
SELECT t1.colA as 'varchar(45)_1', t2.colB as 'varchar(45)_2', count(*)
FROM table t1
JOIN table t2 on t1.date = t2.date
WHERE t1.date BETWEEN '2013-01-01' AND '2013-12-31'
GROUP BY t1.colA, t2.colB

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM