[英]Why submitting job to mapreduce takes so much time in General?
So usually for 20 node cluster submitting job to process 3GB(200 splits) of data takes about 30sec and actual execution about 1m. 因此通常对于20节点集群提交作业来处理3GB(200个分裂)的数据需要大约30秒并且实际执行大约1m。 I want to understand what is the bottleneck in job submitting process and understand next quote 我想了解工作提交过程中的瓶颈是什么,并了解下一个报价
Per-MapReduce overhead is significant: Starting/ending MapReduce job costs time Per-MapReduce开销很重要:开始/结束MapReduce作业成本时间
Some process I'm aware: 1. data splitting 2. jar file sharing 我知道一些过程:1。数据拆分2. jar文件共享
A few things to understand about HDFS and M/R that helps understand this latency: 有关HDFS和M / R的一些事项有助于理解这种延迟:
If you were to process a file with a 1000 lines content then you are better of using a normal file read and process program. 如果您要处理1000行内容的文件,那么您最好使用普通的文件读取和处理程序。 Hadoop infrastructure to spawn a process on a distributed system will not yield any benefit but will only contribute to the additional overhead of locating datanodes containing relevant data chunks, starting the processing programs on them, tracking and collecting results. 在分布式系统上生成进程的Hadoop基础结构不会产生任何好处,但只会导致定位包含相关数据块的数据节点,启动处理程序,跟踪和收集结果的额外开销。
Now expand that to 100 of Peta Bytes of data and these overheads looks completely insignificant compared to time it would take to process them. 现在将其扩展到100个Peta字节数据,与处理它们所需的时间相比,这些开销看起来完全无关紧要。 Parallelization of the processors (mappers and reducers) will show it's advantage here. 处理器(映射器和缩减器)的并行化将在这里显示出它的优势。
So before analyzing the performance of your M/R, you should first look to benchmark your cluster so that you understand the overheads better. 因此,在分析M / R的性能之前,首先应该对集群进行基准测试,以便更好地了解开销。
How much time does it take to do a no-operation map-reduce program on a cluster? 在集群上执行无操作map-reduce程序需要多长时间?
Use MRBench for this purpose: 使用MRBench实现此目的:
To run this program, try the following (Check the correct approach for latest versions: 要运行此程序,请尝试以下操作(检查最新版本的正确方法:
hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 50
Surprisingly on one of our dev clusters it was 22 seconds. 令人惊讶的是,在我们的一个开发群集中它是22秒。
Another issue is file size. 另一个问题是文件大小。
If the file sizes are less than the HDFS block size then Map/Reduce programs have significant overhead. 如果文件大小小于HDFS块大小,则Map / Reduce程序会产生很大的开销。 Hadoop will typically try to spawn a mapper per block. Hadoop通常会尝试为每个块生成一个映射器。 That means if you have 30 5KB files, then Hadoop may end up spawning 30 mappers eventually per block even if the size of file is small. 这意味着如果您有30个5KB文件,那么即使文件大小很小,Hadoop最终也可能最终每个块生成30个映射器。 This is a real wastage as each program overhead is significant compared to the time it would spend processing the small sized file. 这是一个真正的浪费,因为与处理小尺寸文件所花费的时间相比,每个程序开销都很重要。
As far as I know, there is no single bottleneck which causes the job run latency; 据我所知,没有单一的瓶颈导致工作延迟; if there was, it would have been solved a long time ago. 如果有的话,它很久以前就会得到解决。
There are a number of steps which takes time, and there are reasons why the process is slow. 有许多步骤需要时间,并且有理由说明该过程很慢。 I will try to list them and estimate where I can: 我会尝试列出它们并估计我可以在哪里:
I have seen similar issue and I can state the solution to be broken in following steps : 我已经看到了类似的问题,我可以通过以下步骤说明要解决的问题:
Try with the data nodes and name nodes: 尝试使用数据节点和名称节点:
Try installing lower version of hadoop (hadoop 2.5.2) which worked in two cases and it worked in hit and trial. 尝试安装较低版本的hadoop(hadoop 2.5.2),该版本在两种情况下工作,并且它在命中和试用中起作用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.