简体   繁体   中英

Hadoop smaller input file

I am using hadoop in a little different way. In my case, input size is really small. However, computation time is more. I have some complicated algorithm which I will be running on every line of input. So even though the input size is less than 5mb, the overall computation time is over 10hrs. So I am using hadoop here. I am using NLineInputFormat to split the file by number of lines rather than block size. In my initial testing, I had around 1500 lines (Splitting by 200 lines) and I saw only a improvement of 1.5 times in a four node cluster compared to that of running it serially on one machine. I am using VM's. Could that be the issue or for smaller size input there wont be much benefits with hadoop? Any insights will be really helpful.

To me, your workload resembles SETI@Home work load -- small payloads but hours of crunching time.

Hadoop (or more specifically HDFS) is not designed for lots of small files. But I doubt that is an issue for MapReduce - the processing framework you are using.

If you want to keep your workload together: 1) split them into individual files (one workload, one file) if the file is less than block size then it will go to one mapper. Typical block sizes are 64MB or 128MB

2) create a wrapper for FileInputFormat, and override the 'isSplitable()' method to false. This will make sure entire file contents are fed to one mapper, rather than hadoop trying to split it line by line

reference : http://hadoopilluminated.com/hadoop_book/HDFS_Intro.html

Hadoop is not really good at dealing with tons of small files, hence, it is often desired to combine a large number of smaller input files into less number of bigger files so as to reduce number of mappers.

As Input to Hadoop MapReduce process is abstracted by InputFormat . FileInputFormat is a default implementation that deals with files in HDFS. With FileInputFormat , each file is split into one or more InputSplits typically upper bounded by block size . This means the number of input splits is lower bounded by number of input files. This is not an ideal environment for MapReduce process when it's dealing with large number of small files, because overhead of coordinating distributed processes is far greater than when there is relatively large number of small files.

The basic parameter which drives the spit size is mapred.max.split.size .

Using CombineFileInputFormat and this parameter we can control the number of mappers.

Checkout the implementation I had for another answer here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM