简体繁体 English

Hadoop较小的输入文件

[英]Hadoop smaller input file

原文 2013-03-10 23:20:13 5 2 hadoop/ mapreduce

I am using hadoop in a little different way. 我以某种不同的方式使用hadoop。 In my case, input size is really small. 就我而言，输入大小确实很小。 However, computation time is more. 但是，计算时间更多。 I have some complicated algorithm which I will be running on every line of input. 我有一些复杂的算法，将在输入的每一行上运行。 So even though the input size is less than 5mb, the overall computation time is over 10hrs. 因此，即使输入大小小于5mb，整个计算时间仍超过10小时。 So I am using hadoop here. 所以我在这里使用hadoop。 I am using NLineInputFormat to split the file by number of lines rather than block size. 我正在使用NLineInputFormat按行数而不是块大小分割文件。 In my initial testing, I had around 1500 lines (Splitting by 200 lines) and I saw only a improvement of 1.5 times in a four node cluster compared to that of running it serially on one machine. 在最初的测试中，我大约有1500条线（拆分为200条线），与在一台计算机上串行运行相比，在四节点群集中仅看到1.5倍的改进。 I am using VM's. 我正在使用VM。 Could that be the issue or for smaller size input there wont be much benefits with hadoop? 这可能是问题，还是对于较小尺寸的输入，hadoop不会带来太多好处？ Any insights will be really helpful. 任何见解都会非常有帮助。

2 个解决方案

To me, your workload resembles SETI@Home work load -- small payloads but hours of crunching time. 对我来说，您的工作量类似于SETI @ Home的工作量-很小的有效载荷，但是却要花费数小时的时间。

Hadoop (or more specifically HDFS) is not designed for lots of small files. Hadoop（或更确切地说是HDFS）并非为大量小文件而设计。 But I doubt that is an issue for MapReduce - the processing framework you are using. 但是我怀疑这对于MapReduce是一个问题-您正在使用的处理框架。

If you want to keep your workload together: 1) split them into individual files (one workload, one file) if the file is less than block size then it will go to one mapper. 如果要将工作负载放在一起：1）将它们分成单个文件（一个工作负载，一个文件），如果该文件小于块大小，则它将转到一个映射器。 Typical block sizes are 64MB or 128MB 典型的块大小为64MB或128MB

2) create a wrapper for FileInputFormat, and override the 'isSplitable()' method to false. 2）为FileInputFormat创建包装器，并将'isSplitable（）'方法重写为false。 This will make sure entire file contents are fed to one mapper, rather than hadoop trying to split it line by line 这将确保将整个文件内容馈送到一个映射器，而不是hadoop试图逐行拆分它

reference : http://hadoopilluminated.com/hadoop_book/HDFS_Intro.html 参考： http : //hadoopilluminated.com/hadoop_book/HDFS_Intro.html

Hadoop is not really good at dealing with tons of small files, hence, it is often desired to combine a large number of smaller input files into less number of bigger files so as to reduce number of mappers. Hadoop并不是真正擅长处理大量小文件，因此，通常需要将大量较小的输入文件合并为较少数量的较大文件，以减少映射器的数量。

As Input to Hadoop MapReduce process is abstracted by InputFormat . 作为Input to Hadoop的MapReduce流程由InputFormat抽象化。 FileInputFormat is a default implementation that deals with files in HDFS. FileInputFormat是处理HDFS中文件的默认实现。 With FileInputFormat , each file is split into one or more InputSplits typically upper bounded by block size . 使用FileInputFormat ，每个文件被分成一个或多个InputSplits通常以block size为上限。 This means the number of input splits is lower bounded by number of input files. 这意味着输入拆分的数量受输入文件数量的下限限制。 This is not an ideal environment for MapReduce process when it's dealing with large number of small files, because overhead of coordinating distributed processes is far greater than when there is relatively large number of small files. 当MapReduce进程处理大量小文件时，这不是理想的环境，因为协调分布式进程的开销远远大于存在大量小文件时的开销。