简体繁体中英

Hadoop smaller input file

原文 2013-03-10 23:20:13 9 2 hadoop/ mapreduce

I am using hadoop in a little different way. In my case, input size is really small. However, computation time is more. I have some complicated algorithm which I will be running on every line of input. So even though the input size is less than 5mb, the overall computation time is over 10hrs. So I am using hadoop here. I am using NLineInputFormat to split the file by number of lines rather than block size. In my initial testing, I had around 1500 lines (Splitting by 200 lines) and I saw only a improvement of 1.5 times in a four node cluster compared to that of running it serially on one machine. I am using VM's. Could that be the issue or for smaller size input there wont be much benefits with hadoop? Any insights will be really helpful.

2 answers

To me, your workload resembles SETI@Home work load -- small payloads but hours of crunching time.

Hadoop (or more specifically HDFS) is not designed for lots of small files. But I doubt that is an issue for MapReduce - the processing framework you are using.

If you want to keep your workload together: 1) split them into individual files (one workload, one file) if the file is less than block size then it will go to one mapper. Typical block sizes are 64MB or 128MB

2) create a wrapper for FileInputFormat, and override the 'isSplitable()' method to false. This will make sure entire file contents are fed to one mapper, rather than hadoop trying to split it line by line

reference : http://hadoopilluminated.com/hadoop_book/HDFS_Intro.html

Hadoop is not really good at dealing with tons of small files, hence, it is often desired to combine a large number of smaller input files into less number of bigger files so as to reduce number of mappers.

As Input to Hadoop MapReduce process is abstracted by InputFormat . FileInputFormat is a default implementation that deals with files in HDFS. With FileInputFormat , each file is split into one or more InputSplits typically upper bounded by block size . This means the number of input splits is lower bounded by number of input files. This is not an ideal environment for MapReduce process when it's dealing with large number of small files, because overhead of coordinating distributed processes is far greater than when there is relatively large number of small files.

The basic parameter which drives the spit size is mapred.max.split.size .

Using CombineFileInputFormat and this parameter we can control the number of mappers.

Checkout the implementation I had for another answer here .

How hadoop scheduler work when input file smaller than map node

Pentaho Hadoop File Input

How to partition a file to smaller size for performing KNN in hadoop mapreduce

Hadoop Streaming Job with no input file

Location of a Hadoop job input file

Hadoop Input File Name Issue

How to read the Hadoop Sequentil file as an input to the Hadoop job?

how does hadoop read input file?

replace text in input file with hadoop MR

Get input file name in streaming hadoop program

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How hadoop scheduler work when input file smaller than map node Pentaho Hadoop File Input How to partition a file to smaller size for performing KNN in hadoop mapreduce Hadoop Streaming Job with no input file Location of a Hadoop job input file Hadoop Input File Name Issue How to read the Hadoop Sequentil file as an input to the Hadoop job? how does hadoop read input file? replace text in input file with hadoop MR Get input file name in streaming hadoop program

Related Tags

Hadoop smaller input file

Question

2 answers

solution1
0 2013-03-11 02:59:11

solution2
-1 2013-03-11 19:22:26

Hadoop smaller input file

Question

2 answers

solution1 0 2013-03-11 02:59:11

solution2 -1 2013-03-11 19:22:26

solution1
0 2013-03-11 02:59:11

solution2
-1 2013-03-11 19:22:26