MultipleOutputs in hadoop

Question

I am using MultipleOutputs in my reduce program of my reduce phase. Data set that i am working on is around 270 mb and I am running this on my pseudo distributed single node. I have used custom writable for my map output values. keys are countries present in datasets.

public class reduce_class extends Reducer<Text, name, NullWritable, Text> {
    public void reduce(Text key,Iterable<name> values,Context context) throws IOException, InterruptedException{
        MultipleOutputs<NullWritable,Text> m = new MultipleOutputs<NullWritable,Text>(context);
        long pat;
        String n;
        NullWritable out = NullWritable.get();
        TreeMap<Long,ArrayList<String>> map = new TreeMap<Long,ArrayList<String>>();
        for(name nn : values){
            pat = nn.patent_No.get();
            if(map.containsKey(pat))
                map.get(pat).add(nn.getName().toString());
            else{
                map.put(pat,(new ArrayList<String>()));
                map.get(pat).add(nn.getName().toString());}
    }
        for(Map.Entry entry : map.entrySet()){
            n = entry.getKey().toString();
            m.write(out, new Text("--------------------------"), key.toString());
            m.write(out, new Text(n), key.toString());
            ArrayList<String> names = (ArrayList)entry.getValue();
            Iterator i = names.iterator();
            while(i.hasNext()){
                n = (String)i.next();
                m.write(out, new Text(n), key.toString());
        }
            m.write(out, new Text("--------------------------"), key.toString());           
    }
        m.close();
}

}

above is my reduce logic

problems

1) above code works fine with small data set but fails due to heap space with 270 mb data set.

2) Using country as key passes pretty large values in single iterable collection. I tried to solve this but MutlipleOutputs creates unique files for a given set of keys. Point is I am unable to append an already existing file created by previous run of reduce and throws error. thus for particular keys I have to create new files. Is there a way to work around this? . Solving above error caused me to define keys as country names(my final sorted data) but throws java heap error .

Sample Input

3858241,"Durand","Philip","E.","","","Hudson","MA","US","",1 3858241,"Norris","Lonnie","H.","","","Milford","MA","US","",2 3858242,"Gooding","Elwyn","R.","","120 Darwin Rd.","Pinckney","MI","US","48169",1 3858243,"Pierron","Claude","Raymond","","","Epinal","","FR","",1 3858243,"Jenny","Jean","Paul","","","Decines","","FR","",2 3858243,"Zuccaro","Robert","","","","Epinal","","FR","",3 3858244,"Mann","Richard","L.","","PO Box 69","Woodstock","CT","US","06281",1

Sample output for small datasets

sample directory structure...

CA-r-00000

FR-r-00000

Quebec-r-00000

TX-r-00000

US-r-00000

* Individual contents *

3858241 Philip E. Durand

Lonnie H. Norris

3858242

Elwyn R. Gooding

3858244

Richard L. Mann

Answer 1

I know I am answering a very old question here, but anyway let me throw some ideas here. It seems you are creating a TreeMap in your reducer with all the records that you get in one reduce call. In Mapreduce you cannot afford to hold all the records in memory, cause it will never scale. You are making a Map of patent_no and all the names associated with that patent_no . All you want is to separate out the records based on patent_no , so why not leverage the sorting of mapreduce framework.

You should include the patent_no and name along with country in the writable Key itself.

Write your Partitioner to partition only on the basis of country .
Sorting should be on country , patent_no , name .
You should write your Grouping comparator to group upon country , patent_no .

As a result all the records with same country would go to same reducer and sorted by patent_no and name . And within the same reducer different patent_no would go to different reduce call. Now all you need is simple write it in the MultipleOutputs. Thus you get rid of any in memory TreeMap.

And some points which I would suggest you should take care are:

Do not create new MultipleOutputs in the reduce method every time, instead you should write a setup() method and create only one in the setup() method.
Do not create new Text() every time instead create one in the setup method and reuse the same instance by set("string") method of Text . You can argue that whats the point there, Java's GC will anyway garbage collect that. But you should always try to use memory as low as possible so that the Java's garbage collection should be called less frequently.

MultipleOutputs in hadoop

Question

Lonnie H. Norris

Elwyn R. Gooding

Richard L. Mann

1 answers

solution1
0 2015-01-10 16:35:39

MultipleOutputs in hadoop

Question

Lonnie H. Norris

Elwyn R. Gooding

Richard L. Mann

1 answers

solution1 0 2015-01-10 16:35:39

solution1
0 2015-01-10 16:35:39