简体   繁体   中英

Reading large files using mapreduce in hadoop

I have a code that reads files from FTP server and writes it into HDFS . I have implemented a customised InputFormatReader that sets the isSplitable property of the input as false .However this gives me the following error.

INFO mapred.MapTask: Record too large for in-memory buffer

The code I use to read data is

Path file = fileSplit.getPath();
                FileSystem fs = file.getFileSystem(conf);
                FSDataInputStream in = null;
                try {
                    in = fs.open(file);


                    IOUtils.readFully(in, contents, 0, contents.length);

                    value.set(contents, 0, contents.length);

                }

Any ideas how to avoid java heap space error without splitting the input file ? Or in case I make isSplitable true how do I go about reading the file ?

If I got you right - you load the whole file in memory. Unrelated to hadoop - you can not do it on Java and be sure that you have enough memory.
I would suggest to define some resonable chunk and make it to be "a record"

While a Map function is running hadoop collects output records in an in-memory buffer called MapOutputBuffer.

The total size of this in memory buffer is set by the io.sort.mb property and defaults to 100 MB.

Try increasing this property value in mapred-site.xml

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM