简体   繁体   中英

Hadoop Distributed Cache to process large look up text file

I am trying to implement a MapReduce job that processes a large text file (as a look up file) in addition to the actual dataset (input). the look up file is more than 2GB. I tried to load the text file as a third argument as follows:

but I got Java Heap Space Error.

After doing some search, it is suggested to use Distributed Cache. this is what I have done so far First, I used this method to read the look up file:

public static String readDistributedFile(Context context) throws IOException {
        URI[] cacheFiles = context.getCacheFiles();
        Path path = new Path(cacheFiles[0].getPath().toString());
        FileSystem fs = FileSystem.get(new Configuration());
        StringBuilder sb = new StringBuilder();
        BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(path)));
        String line;
        while ((line = br.readLine()) != null) {
            // split line
            sb.append(line);
            sb.append("\n");
        }
        br.close();
        return sb.toString();        
    }

Second, In the Mapper:

protected void setup(Context context)
                throws IOException, InterruptedException {
            super.setup(context);

            String lookUpText = readDistributedFile(context);
            //do something with the text
        }

Third, to run the job

hadoop jar mapReduceJob.jar the.specific.class -files ../LargeLookUpFileInStoredLocally.txt /user/name/inputdataset/*.gz /user/name/output

But the problem is that the job is taking long time to be load. May be it was not a good idea to use the distributed cache or may be I am missing something in my code.

I am working with Hadoop 2.5. I have already checked some related questions such as [1].

Any ideas will be great!

[1] Hadoop DistributedCache is deprecated - what is the preferred API?

Distributed cache is mostly used to move the files which are needed by Map reduce at the task nodes and are not part of jar.

Other usage is when performing joins which include a big and small data set, so that, rather than using Multiple input paths, we use a single input(big) file, and get the other small file using distributed cache and then compare(or join) both the data sets.

The reason for more time in your case is because you are trying to read entire 2 gb file before the map reduce starts(as it is started in setup method).

Can you give the reason why you are loading the the huge 2gb file using distributed cache.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM