简体   繁体   中英

Multiple Inputs : Adding same input to multiple mappers for comparison

I have two Mapper Classes which take some files from the same folder as input and based on the name of the file which has a timestamp determines which mapper the file has to be given as an Input. At times it so happens that the same input file is to be given as an input to two different Mappers. Now I've tested it to work when two different inputs are given to both Mappers but When I give them the same input , then one of the Mapper class doesn't generate the result to be used for comparison in the reducer.

The code is enormous so instead of putting it here , I'll describe what I had done. I created two lists and scanning through the files in the directory and based on the names of the files which have timestamps , I put them in two different lists and then add them to two different Mappers ie both of them are computed differently so I use different Mappers to compute , which is then used to compare in the reducer, but when it is the same Input file as the time criteria for both mappers is almost same one of the mapper doesn't generate any result. So is it because one mapper is not able to access the file because the other is using it and if that is the case is there any way around it.

Here MapPath1 is one list while MapPath2 is another

for(i=0;i<MapPath1.size();i++)
      MultipleInputs.addInputPath(job,new Path(MapPath1.get(i)),TextInputFormat.class,Map1.class);
if(type.equals("comparative"))
      for(i=0;i<MapPath2.size();i++)
            MultipleInputs.addInputPath(job,new Path(MapPath2.get(i)),TextInputFormat.class,Map2.class); 

Update

I just Found this question ( Multiple mappers in hadoop ) to be similar to mine but I don't want to be duplicating the input file as it can be large. Can any one direct me on how can I create two separate jobs using different Mappers and provide it to a single reducer.

one of the Mapper class doesn't generate the result to be used for comparison in the reducer.

My guess that both the mappers are getting launched on the same task tracker node and intermediate mapper output location is shared by both the mapper task - You should check the task tracker nodes where these map tasks are launched to confirm this.

Also you should run mapper(s) only job, by setting number of reduce tasks to zero and check the output - this is to confirm that mapper are not sharing output directories.

To give solution to your problem - it sounds like you are passing same file to both the mappers and data from both the mappers given to single reducer. This has some duplication, Is your job output ok to have this duplication?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM