简体   繁体   中英

Hbase MapReduce: how to use custom class as value for the mapper and/or reducer?

I am trying to familiarize myself with Hadoop/Hbase MapReduce jobs to be able to properly write them. Right now I have an Hbase instance with a table called dns with some DNS records. I tried to make a simple unique domains counter that outputs a file and it worked. Right now, I only use IntWritable or Text and I was wondering if it's possible to use custom objects for my Mapper/Reducer. I tried to do it myself, but I'm getting

Error: java.io.IOException: Initialization of all the collectors failed. Error in last collector was :null
    at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:415)
    at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:81)
    at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:698)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
Caused by: java.lang.NullPointerException
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:1011)
    at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:402)
    ... 9 more

Since I'm new to this, I don't actually know what to do. I'm guessing I have to implement one or more interfaces or to extend an abstract class, but I can't find here or on the internet a proper example.

I tried to make a simple domains counter from my dns table, but using a class as a wrapper over an integer (for didactic purposes only). My Map class looks like this:

public class Map extends TableMapper<Text, MapperOutputValue> {
    private static byte[] columnName = "fqdn".getBytes();
    private static byte[] columnFamily = "d".getBytes();

    public void map(ImmutableBytesWritable row, Result value, Context context)
            throws InterruptedException, IOException {

        String fqdn = new String(value.getValue(columnFamily, columnName));
        Text key = new Text();
        key.set(fqdn);
        context.write(key, new MapperOutputValue(1));

    }
}

The Reducer :

public class Reduce extends Reducer<Text, MapperOutputValue, Text, IntWritable> {
    @Override
    public void reduce(Text key, Iterable<MapperOutputValue> values, Context context)
            throws IOException, InterruptedException {

        int i = 0;
        for (MapperOutputValue val : values) {
            i += val.getCount();
        }

        context.write(key, new IntWritable(i));
    }
}

And a part of my Driver/Main function:

 TableMapReduceUtil.initTableMapperJob(
                "dns",
                scan,
                Map.class,
                Text.class,
                MapperOutputValue.class,
                job);

/* Set output parameters */
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setOutputFormatClass(TextOutputFormat.class);

As I said, MapperOutputValue is just a simple class that contains a private Integer, a constructor with a parameter, a getter and a setter. I also tried adding a toString method but it still doesn't work.

So my question is: what's the best way to use custom classes as an output of the mapper/input for the reducer? Also, let's say I want to use a class with multiple fields as an final output of the reducer. What should this class implement/extends? Is it a good idea or I should stick to using "primitives" as IntWritable or Text?

Thank!

MapOutputValue should implement Writable , so that it can be serialised between tasks in the MapReduce job. Replacing MapOutputJob with the below should work:

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class DomainCountWritable implements Writable {
    private Text domain;
    private IntWritable count;

    public DomainCountWritable() {
        this.domain = new Text();
        this.count = new IntWritable(0);
    }

    public DomainCountWritable(Text domain, IntWritable count) {
        this.domain = domain;
        this.count = count;
    }

    public Text getDomain() {
        return this.domain;
    }

    public IntWritable getCount() {
        return this.count;
    }

    public void setDomain(Text domain) {
        this.domain = domain;
    }

    public void setCount(IntWritable count) {
        this.count = count;
    }

    public void readFields(DataInput in) throws IOException {
        this.domain.readFields(in);
        this.count.readFields(in);
    }

    public void write(DataOutput out) throws IOException {
        this.domain.write(out);
        this.count.write(out);
    }

    @Override
    public String toString() {
        return this.domain.toString() + "\t" + this.count.toString();
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM