简体   繁体   中英

Hadoop (java) change the type of Mapper output values

I am writing a mapper function that generates the keys as some user_id and the values are also Text type. Here is how I do this

public static class UserMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text userid = new Text();
    private Text catid = new Text();

    /* map method */
    public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString(), ","); /* separated by "," */
        int count = 0;

        userid.set(itr.nextToken());

        while (itr.hasMoreTokens()) {
            if (++count == 3) {
                catid.set(itr.nextToken());
                context.write(userid, catid);
            }else {
                itr.nextToken();
            }
        }
    }
}

And then, in the main program, I set the output class of the mapper as follows:

    Job job = new Job(conf, "Customer Analyzer");
    job.setJarByClass(popularCategories.class);
    job.setMapperClass(UserMapper.class);
    job.setCombinerClass(UserReducer.class);
    job.setReducerClass(UserReducer.class);

    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);

So even though I have set the class of the output values to be of Text.class , still I get the following error when compile it:

popularCategories.java:39: write(org.apache.hadoop.io.Text,org.apache.hadoop.io.IntWritable)
 in org.apache.hadoop.mapreduce.TaskInputOutputContext<java.lang.Object,
 org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,
 org.apache.hadoop.io.IntWritable> 
 cannot be applied to (org.apache.hadoop.io.Text,org.apache.hadoop.io.Text)
 context.write(userid, catid);
                           ^

According to this error, it is still considering a mapper class of this format: write(org.apache.hadoop.io.Text,org.apache.hadoop.io.IntWritable)

So, when I change the class definition as follows, the problem is solved.

 public static class UserMapper extends Mapper<Object, Text, Text, Text> {

 }

So, I want to understand what is the difference between the class definition and setting the mapper output vaue class.

In your mapper class definition, you are setting the outputValue class to IntWriteable.

public static class UserMapper extends Mapper<Object, Text, Text, IntWritable>

However, in the mapper class, your are instantiating catId as Text.

private Text catid = new Text();

Even though you have set the MapOutputValueClass as Text you will need to change the definition of your mapper class to be in sync with the key and value output classes set in the driver.

From Apache documentation page

Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

java.lang.Object
org.apache.hadoop.mapreduce.Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Where

KEYIN = offset of the record  ( input for Mapper )
VALUEIN = value of the line in the record ( input for Mapper )
KEYOUT = Mapper output key ( Output of Mapper, input of Reducer)
VALUEOUT = Mapper output value ( Output of Mapper, input to Reducer)

Your problem has been solved after you have corrected the Mapper value in your definition from

public static class UserMapper extends Mapper<Object, Text, Text, IntWritable> {

to

public static class UserMapper extends Mapper<Object, Text, Text, Text> {

Have a look at related SE question:

Why LongWritable (key) has not been used in Mapper class?

I have found this article is also useful to understand the concepts clearly.

The class definition has both the input and output type. For instance your Mapper is taking in Object,Text and emitting Text,Text . In your driver class you have set the expected output of the Mapper Class to Text for both the key and value, therefore the hadoop framework is expecting your Mapper Class definition to have these output types and for your class to emit Text for both the key and value when you call context.write(Text,Text) .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM