简体繁体 English

Hadoop MapReduce映射器任务花费时间从HDFS或S3读取输入文件

[英]Time spent by a Hadoop MapReduce mapper task to read input files from HDFS or S3

原文 2013-11-21 04:09:36 6 1 hadoop/ mapreduce/ mapper

I am running a Hadoop MapReduce job, getting input files from HDFS or Amazon S3. 我正在运行Hadoop MapReduce作业，正在从HDFS或Amazon S3获取输入文件。 I am wondering if it's possible to know how long does it take for a mapper task to read file from HDFS or S3 to the mapper. 我想知道是否有可能知道一个映射器任务从HDFS或S3读取文件到映射器需要多长时间。 I'd like to know the time just for reading data, not include mapper processing time of those data. 我想知道仅用于读取数据的时间，不包括那些数据的映射器处理时间。 The result I am looking for is something like MB/second for a certain mapper task, which indicates how fast the mapper can read from HDFS or S3. 对于特定的映射器任务，我正在寻找的结果约为MB /秒，这表明映射器可以从HDFS或S3中读取的速度。 It's something like a I/O performance. 这有点像I / O性能。

Thanks. 谢谢。

1 个解决方案

Maybe you can just use a unit mapper and set the number of reducer to zero . 也许您可以只使用一个单位映射器，并将reducer的数量设置为零 。 Then the only thing that is done in your simulation is I/O, there will be no sorting and shuffling. 这样，在仿真中唯一要做的就是I / O，就不会进行排序和改组。 Or if you specifically want to focus on reading then you can replace the unit mapper with a function that doesn't write any output. 或者，如果您特别希望专注于阅读，则可以使用不写入任何输出的函数来替换单位映射器。 Next I would set mapred.jvm.reuse=-1 , to remove the jvm overhead. 接下来，我将设置mapred.jvm.reuse=-1 ，以消除jvm的开销。 It isn't perfect but it is probably the easiest way to have a quick idea. 这不是完美的方法，但它可能是拥有快速构想的最简单方法。 If you want to do it precisely I would consider having a look at implemening your own hadoop counters, but currently I have no expericence with that. 如果您想精确地做到这一点，我会考虑看看如何实现自己的hadoop计数器，但是目前我还没有经验。