简体   繁体   中英

Hadoop MapReduce: Possible to define two mappers and reducers in one hadoop job class?

I have two separate java classes for doing two different mapreduce jobs. I can run them independently. The input files on which they are operating are the same for both of the jobs. So my question is whether it is possible to define two mappers and two reducers in one java class like

mapper1.class
mapper2.class
reducer1.class
reducer2.class

and then like

job.setMapperClass(mapper1.class);
job.setmapperClass(mapper2.class);
job.setCombinerClass(reducer1);
job.setCombinerClass(reducer2);
job.setReducerClass(reducer1);
job.setReducerClass(reducer2);

Do these set Methods actually override the previous ones or add the new ones? I tried the code, but it executes the only latest given classes which brings me thinking that it overrides. But there must be a way of doing this right?

The reason why I am asking this is I can read the input files only once (one I/O) and then process two map reduce jobs. I also would like to know how I can write the output files into two different folders. At the moment, both jobs are separate and require an input and an output directory.

You can have multiple mappers, but in one job, you can only have one reducer. And the features you need are MultipleInput , MultipleOutput and GenericWritable .

Using MultipleInput , you can set the mapper and the corresponding inputFormat. Here is my post about how to use it.

Using GenericWritable , you can separate different input classes in the reducer. Here is my post about how to use it.

Using MultipleOutput , you can output different classes in the same reducer.

You can use the MultipleInputs and MultipleOutputs classes for this, but the output of both mappers will go to both reducers. If the data flows for the two mapper/reducer pairs really are independent of one another then keep them as two separate jobs. By the way, MultipleInputs will run your mappers with out change, but the reducers would have to be modified in order to use MultipleOutputs

As per my understanding, which comes from using map-reduce with Hadoop streaming, you can chain multiple mappers and reducers where one consumes the output of another

But you should not be able to run different mappers and reducers simultaneously. Mappers themselves are dependent on no of blocks to be processed. Mapper should be instantiated based on that decision and not the variety of mapper available for the job.

[Edit: Based on your comment]

I don't think that is possible. You can chain (where reducers will receive all inputs from mappers. You can sequence them but you can not exclusively run independent sets of mapper and reducers.

I think what you can do is, even though you receive both the inputs from the mappers into both of your reducers, you can make mappers output (K,V) is such a way that you could distinguish in your reducers as to which mapper was the origin of (K,V). This way both reducers can process on selective (K,V) pairs.

The ChainMapper class allows to use multiple Mapper classes within a single Map task. For example please look at here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM