简体   繁体   中英

How to combine two mappers to one reducer

I am using hadoop to compare two files. I'm using two mappers where each file going to one map and one reducer. The first map is going to get a normal text file, and the second mapper is going to get a file with this format in each line:

word 1 or -1

The inputs of maps are:

public void map(LongWritable key, Text value, Context context) 

The first map out put will be:

key:word value:0

and The second mapper out put will be:

word 1 or -1

The inputs for the reducer is:

public void reduce(Text key, Iterable<IntWritable> values, Context context) 

The output of reducer is:

context.write(key, new IntWritable(sum));

The result I am getting is from each map separately, I want the reducers to get the same key/value from both maps and get it into one result. This is the code.

public class CompareTwoFiles extends Configured implements Tool {
static ArabicStemmer Stemmer=new ArabicStemmer();
String ArabicWord="";

public static class Map extends Mapper <LongWritable, Text, Text, IntWritable> {

int n=0;
private Text num = new Text();
private Text word = new Text();
@Override    
public void map(LongWritable key, Text value, Context context)  throws IOException, InterruptedException {

String line = value.toString();
String token="";

StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
token=tokenizer.nextToken();
Stemmer.stemWord(token);
word.set(token);
context.write(word,new IntWritable(0));
}
}
}

public static class Map2 extends Mapper <LongWritable, Text, Text, IntWritable> {
int n=0;
private Text word = new Text();  
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String token="";

if (line.contains("1") && !line.contains("-1"))
{
n=1;
}
else if (line.contains("-1"))
{
n=-1;
}
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
token=tokenizer.nextToken();
if(!(token.equals("1"))&& !(token.equals("-1")))
{word.set(token);
context.write(word,new IntWritable(n));
}
}
}
}

public static class Reduce extends  Reducer<Text, IntWritable, Text, IntWritable> {

Text sumT= new Text();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
int num=0;
int[] intArr =new int[2];
boolean flag=false;
int i=0;

while (values.iterator().hasNext()) {            
sum += values.iterator().next().get();
}   

if(sum!=0){
context.write(key, new IntWritable(sum));
}   
}
}
public static void main(String[] args) throws Exception {
           int res = ToolRunner.run(new Configuration(), new CompareTwoFiles(), args);
System.exit(res);
}
@Override
public int run(String[] args) throws Exception {

Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://localhost:8020");
conf.set("hadoop.job.ugi", "hdfs");
Job job = new Job(conf);
job.setJarByClass(CompareTwoFiles.class);
job.setJobName("compare");
job.setReducerClass(Reduce.class);
job.setMapperClass(Map.class);
job.setMapperClass(Map2.class);
job.setCombinerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
MultipleInputs.addInputPath(job, new Path(args[0]),
TextInputFormat.class, Map.class);
MultipleInputs.addInputPath(job, new Path(args[1]),
TextInputFormat.class, Map2.class); 
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
return job.waitForCompletion(true) ? 0 : 1;
}

The result I'm getting is something like this:

first map
w1 0
w2 0
second map
w1 1
w2 3
w3 -1

The whole concept of MapReduce is that a Mapper can emit one value per key, in your case one value per word, and then there will be one Reducer for each key (in your case, one Reducer should receive all counts for one word). That is, in the Mapper, you will write something like [key, value] for each word you have registered. There can only be one Mapper class and one Reducer class for one run.

In your case, it does not sound like MapReduce is a good fit for your problem. Comparing one file to another is not necessarily a problem that leans itself naturally to efficiency gains through partitioning and parallelization. What you might be able to do is to partition the text file and send a text partition and the entire word 1 or -1 file to each Mapper. The Reducers will then compute a sum/value for each word.

Feel free to also post your Mapper and Reducer classes here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM