为什么向mapreduce提交工作需要花费这么多时间？

Question

So usually for 20 node cluster submitting job to process 3GB(200 splits) of data takes about 30sec and actual execution about 1m. 因此通常对于20节点集群提交作业来处理3GB（200个分裂）的数据需要大约30秒并且实际执行大约1m。 I want to understand what is the bottleneck in job submitting process and understand next quote 我想了解工作提交过程中的瓶颈是什么，并了解下一个报价

Per-MapReduce overhead is significant: Starting/ending MapReduce job costs time Per-MapReduce开销很重要：开始/结束MapReduce作业成本时间

Some process I'm aware: 1. data splitting 2. jar file sharing 我知道一些过程：1。数据拆分2. jar文件共享

Answer 1

A few things to understand about HDFS and M/R that helps understand this latency: 有关HDFS和M / R的一些事项有助于理解这种延迟：

HDFS stores your files as data chunk distributed on multiple machines called datanodes HDFS将您的文件存储为分布在多台名为datanode的计算机上的数据块
M/R runs multiple programs called mapper on each of the data chunks or blocks. M / R在每个数据块或块上运行称为映射器的多个程序。 The (key,value) output of these mappers are compiled together as result by reducers. 这些映射器的（键，值）输出由reducers汇总在一起。 (Think of summing various results from multiple mappers) （想想总结来自多个映射器的各种结果）
Each mapper and reducer is a full fledged program that is spawned on these distributed system. 每个映射器和reducer都是在这些分布式系统上生成的完整程序。 It does take time to spawn a full fledged programs, even if let us say they did nothing (No-OP map reduce programs). 即使我们说他们没有做任何事情（No-OP map reduce program），它确实需要时间来产生一个完整的程序。
When the size of data to be processed becomes very big, these spawn times become insignificant and that is when Hadoop shines. 当要处理的数据的大小变得非常大时，这些生成时间变得微不足道，这就是Hadoop闪耀的时候。

If you were to process a file with a 1000 lines content then you are better of using a normal file read and process program. 如果您要处理1000行内容的文件，那么您最好使用普通的文件读取和处理程序。 Hadoop infrastructure to spawn a process on a distributed system will not yield any benefit but will only contribute to the additional overhead of locating datanodes containing relevant data chunks, starting the processing programs on them, tracking and collecting results. 在分布式系统上生成进程的Hadoop基础结构不会产生任何好处，但只会导致定位包含相关数据块的数据节点，启动处理程序，跟踪和收集结果的额外开销。

Now expand that to 100 of Peta Bytes of data and these overheads looks completely insignificant compared to time it would take to process them. 现在将其扩展到100个Peta字节数据，与处理它们所需的时间相比，这些开销看起来完全无关紧要。 Parallelization of the processors (mappers and reducers) will show it's advantage here. 处理器（映射器和缩减器）的并行化将在这里显示出它的优势。

So before analyzing the performance of your M/R, you should first look to benchmark your cluster so that you understand the overheads better. 因此，在分析M / R的性能之前，首先应该对集群进行基准测试，以便更好地了解开销。

How much time does it take to do a no-operation map-reduce program on a cluster? 在集群上执行无操作map-reduce程序需要多长时间？

Use MRBench for this purpose: 使用MRBench实现此目的：

MRbench loops a small job a number of times MRbench多次循环一个小工作
Checks whether small job runs are responsive and running efficiently on your cluster. 检查小作业运行是否响应并在群集上高效运行。
Its impact on the HDFS layer is very limited 它对HDFS层的影响非常有限

To run this program, try the following (Check the correct approach for latest versions: 要运行此程序，请尝试以下操作（检查最新版本的正确方法：

hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 50

Surprisingly on one of our dev clusters it was 22 seconds. 令人惊讶的是，在我们的一个开发群集中它是22秒。

Another issue is file size. 另一个问题是文件大小。

If the file sizes are less than the HDFS block size then Map/Reduce programs have significant overhead. 如果文件大小小于HDFS块大小，则Map / Reduce程序会产生很大的开销。 Hadoop will typically try to spawn a mapper per block. Hadoop通常会尝试为每个块生成一个映射器。 That means if you have 30 5KB files, then Hadoop may end up spawning 30 mappers eventually per block even if the size of file is small. 这意味着如果您有30个5KB文件，那么即使文件大小很小，Hadoop最终也可能最终每个块生成30个映射器。 This is a real wastage as each program overhead is significant compared to the time it would spend processing the small sized file. 这是一个真正的浪费，因为与处理小尺寸文件所花费的时间相比，每个程序开销都很重要。

Answer 2

As far as I know, there is no single bottleneck which causes the job run latency; 据我所知，没有单一的瓶颈导致工作延迟; if there was, it would have been solved a long time ago. 如果有的话，它很久以前就会得到解决。

There are a number of steps which takes time, and there are reasons why the process is slow. 有许多步骤需要时间，并且有理由说明该过程很慢。 I will try to list them and estimate where I can: 我会尝试列出它们并估计我可以在哪里：

Run hadoop client. 运行hadoop客户端。 It is running Java, and I think about 1 second overhead can be assumed. 它正在运行Java，我认为可以假设大约1秒的开销。
Put job into the queue and let the current scheduler to run the job. 将作业放入队列并让当前的调度程序运行该作业。 I am not sure what is overhead, but, because of async nature of the process some latency should exists. 我不确定什么是开销，但是，由于进程的异步性质，应该存在一些延迟。
Calculating splits. 计算拆分。
Running and syncronizing tasks. 运行和同步任务。 Here we face with the fact that TaskTrackes poll the JobTracker, and not opposite. 在这里，我们面对的事实是TaskTrackes轮询JobTracker，而不是相反。 I think it is done for the scalability sake. 我认为这是为了扩展性而做的。 It mean that when JobTracker wants to execute some task, it do not call task tracker, but wait that approprieate tracker will ping it to get the job. 这意味着当JobTracker想要执行某项任务时，它不会调用任务跟踪器，而是等待该适当的跟踪器将ping它以获得该作业。 Task trackers can not ping JobTracker to frequently, otherwise they will kill it in large clusters. 任务跟踪器无法频繁ping JobTracker，否则会在大型集群中将其杀死。
Running tasks. 运行任务。 Without JVM reuse it takes about 3 seconds, with it overhead is about 1 seconds per task. 如果没有JVM重用，大约需要3秒，每个任务的开销大约为1秒。
Client poll job tracker for the results (at least I think so) and it also add some latency to getting information that job is finished. 结果的客户端轮询作业跟踪器（至少我认为是这样），它还增加了一些延迟，以获取作业完成的信息。

Answer 3

I have seen similar issue and I can state the solution to be broken in following steps : 我已经看到了类似的问题，我可以通过以下步骤说明要解决的问题：

When the HDFS stores too many small files with fixed chunk size, there will be issues on efficiency in HDFS, the best way would be to remove all unnecessary files and small files having data. 当HDFS存储太多具有固定块大小的小文件时，HDFS的效率会出现问题，最好的方法是删除所有不必要的文件和有数据的小文件。 Try again. 再试一次。
Try with the data nodes and name nodes: 尝试使用数据节点和名称节点：
- Stop all the services using stop-all.sh. 使用stop-all.sh停止所有服务。
- Format name-node 格式名称节点
- Reboot machine 重启机器
- Start all services using start-all.sh 使用start-all.sh启动所有服务
- Check data and name nodes. 检查数据和名称节点。
Try installing lower version of hadoop (hadoop 2.5.2) which worked in two cases and it worked in hit and trial. 尝试安装较低版本的hadoop（hadoop 2.5.2），该版本在两种情况下工作，并且它在命中和试用中起作用。

为什么向mapreduce提交工作需要花费这么多时间？

问题描述

3 个解决方案

解决方案1
13 已采纳 2012-07-06 21:00:36

解决方案2
5 2012-07-07 09:56:23

解决方案3
1 2017-02-01 01:53:27

为什么向mapreduce提交工作需要花费这么多时间？

问题描述

3 个解决方案

解决方案1 13 已采纳 2012-07-06 21:00:36

解决方案2 5 2012-07-07 09:56:23

解决方案3 1 2017-02-01 01:53:27

解决方案1
13 已采纳 2012-07-06 21:00:36

解决方案2
5 2012-07-07 09:56:23

解决方案3
1 2017-02-01 01:53:27