Hadoop Distributed Cache to process large look up text file

Question

I am trying to implement a MapReduce job that processes a large text file (as a look up file) in addition to the actual dataset (input). the look up file is more than 2GB. I tried to load the text file as a third argument as follows:

but I got Java Heap Space Error.

After doing some search, it is suggested to use Distributed Cache. this is what I have done so far First, I used this method to read the look up file:

public static String readDistributedFile(Context context) throws IOException {
        URI[] cacheFiles = context.getCacheFiles();
        Path path = new Path(cacheFiles[0].getPath().toString());
        FileSystem fs = FileSystem.get(new Configuration());
        StringBuilder sb = new StringBuilder();
        BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));
        String line;
        while ((line = br.readLine()) != null) {
            // split line
            sb.append(line);
            sb.append("\n");
        }
        br.close();
        return sb.toString();        
    }

Second, In the Mapper:

protected void setup(Context context)
                throws IOException, InterruptedException {
            super.setup(context);

            String lookUpText = readDistributedFile(context);
            //do something with the text
        }

Third, to run the job

hadoop jar mapReduceJob.jar the.specific.class -files ../LargeLookUpFileInStoredLocally.txt /user/name/inputdataset/*.gz /user/name/output

But the problem is that the job is taking long time to be load. May be it was not a good idea to use the distributed cache or may be I am missing something in my code.

I am working with Hadoop 2.5. I have already checked some related questions such as [1].

Any ideas will be great!

[1] Hadoop DistributedCache is deprecated - what is the preferred API?

Answer 1

Distributed cache is mostly used to move the files which are needed by Map reduce at the task nodes and are not part of jar.

Other usage is when performing joins which include a big and small data set, so that, rather than using Multiple input paths, we use a single input(big) file, and get the other small file using distributed cache and then compare(or join) both the data sets.

The reason for more time in your case is because you are trying to read entire 2 gb file before the map reduce starts(as it is started in setup method).

Can you give the reason why you are loading the the huge 2gb file using distributed cache.

Hadoop Distributed Cache to process large look up text file

Question

1 answers

solution1
0 2015-10-20 19:41:30

Hadoop Distributed Cache to process large look up text file

Question

1 answers

solution1 0 2015-10-20 19:41:30

solution1
0 2015-10-20 19:41:30