Reading large files using mapreduce in hadoop

Question

I have a code that reads files from FTP server and writes it into HDFS . I have implemented a customised InputFormatReader that sets the isSplitable property of the input as false .However this gives me the following error.

INFO mapred.MapTask: Record too large for in-memory buffer

The code I use to read data is

Path file = fileSplit.getPath();
                FileSystem fs = file.getFileSystem(conf);
                FSDataInputStream in = null;
                try {
                    in = fs.open(file);


                    IOUtils.readFully(in, contents, 0, contents.length);

                    value.set(contents, 0, contents.length);

                }

Any ideas how to avoid java heap space error without splitting the input file ? Or in case I make isSplitable true how do I go about reading the file ?

Answer 1

If I got you right - you load the whole file in memory. Unrelated to hadoop - you can not do it on Java and be sure that you have enough memory.
I would suggest to define some resonable chunk and make it to be "a record"

Answer 2

While a Map function is running hadoop collects output records in an in-memory buffer called MapOutputBuffer.

The total size of this in memory buffer is set by the io.sort.mb property and defaults to 100 MB.

Try increasing this property value in mapred-site.xml

Reading large files using mapreduce in hadoop

Question

2 answers

solution1
2 ACCPTED 2013-01-01 18:24:38

solution2
1 2012-12-31 16:12:00

Reading large files using mapreduce in hadoop

Question

2 answers

solution1 2 ACCPTED 2013-01-01 18:24:38

solution2 1 2012-12-31 16:12:00

solution1
2 ACCPTED 2013-01-01 18:24:38

solution2
1 2012-12-31 16:12:00