Hadoop and re-used mutable writable fields

Question

Here's a snippet from an implementation of a word-count job posted from an Apache Tutorial

public static class TokenizerMapper  extends Mapper<Object, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
}

Is there any benefit to reusing the Text word field?

I've seen this done in many Hadoop programs, is instantiation of this class so heavy that reusing results in performance improvements. If not, why do people do it, as opposed to something like context.write(new Text(itr.nextToken()), one);

Answer 1

You're correct in that the instantiation of a Text object is not heavy. However if you're processing billions of records then you want to shave every possible nanosecond per record. Every time you create a new Text object, Java has to allocate memory for it, keep track of it, and then garbage collect it at some point. This time really can add up in big jobs.

Hadoop and re-used mutable writable fields

Question

1 answers

solution1
1 ACCPTED 2019-11-07 07:59:35

Hadoop and re-used mutable writable fields

Question

1 answers

solution1 1 ACCPTED 2019-11-07 07:59:35

solution1
1 ACCPTED 2019-11-07 07:59:35