How to get the current filename in Hadoop Reduce

Question

I am using the WordCount example and in the Reduce function, I need to get the file name.

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    int sum = 0;
    while (values.hasNext()) {
      sum += values.next().get();
    }
    String filename = ((FileSplit)(.getContext()).getInputSplit()).getPath().getName();
    // ----------------------------^ I need to get the context and filename!
    key.set(key.toString() + " (" + filename + ")");
    output.collect(key, new IntWritable(sum));
  }
}

This is the above modified code currently, where I wanna get the filename to be printed for the word. I tried following Java Hadoop: How can I create mappers that take as input files and give an output which is the number of lines in each file? but I couldn't get the context object.

I am new to hadoop and need this help. Any help guys?

Answer 1

You can't get context , because context is a construct of the "new API", and you are using the "old API".

Check out this word count example instead: http://wiki.apache.org/hadoop/WordCount

See the signature of the reduce function in this case:

public void reduce(Text key, Iterable<IntWritable> values, Context context)

See! The context! Notice in this example it imports from .mapreduce. instead of .mapred. .

This is a common issue for new hadoop users, so don't feel bad. In general you want to stick to the new API for a number of reasons. But, be very careful of examples that you find. Also, realize that the new API and old API are not interoperable (eg, you can't have a new API mapper and an old API reducer).

Answer 2

Using the old MR API (org.apache.hadoop.mapred package), add the below to the mapper/reducer class.

String fileName = new String();
public void configure(JobConf job)
{
    filename = job.get("map.input.file");
}

Using the new MR API (org.apache.hadoop.mapreduce package), add the below to the mapper/reducer class.

String fileName = new String();
protected void setup(Context context) throws java.io.IOException, java.lang.InterruptedException
{
    fileName = ((FileSplit) context.getInputSplit()).getPath().toString();
}

Answer 3

I used this way and it works!!!

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
      FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
      String filename = fileSplit.getPath().getName();
      word.set(tokenizer.nextToken());
      output.collect(word, one);
    }
  }
}

Let me know if I can improve it!

How to get the current filename in Hadoop Reduce

Question

3 answers

solution1
4 ACCPTED 2013-12-17 18:09:58

solution2
3 2013-12-17 18:26:58

solution3
2 2013-12-17 18:40:28

How to get the current filename in Hadoop Reduce

Question

3 answers

solution1 4 ACCPTED 2013-12-17 18:09:58

solution2 3 2013-12-17 18:26:58

solution3 2 2013-12-17 18:40:28

solution1
4 ACCPTED 2013-12-17 18:09:58

solution2
3 2013-12-17 18:26:58

solution3
2 2013-12-17 18:40:28