简体   繁体   中英

Hadoop and re-used mutable writable fields

Here's a snippet from an implementation of a word-count job posted from an Apache Tutorial

public static class TokenizerMapper  extends Mapper<Object, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
}

Is there any benefit to reusing the Text word field?

I've seen this done in many Hadoop programs, is instantiation of this class so heavy that reusing results in performance improvements. If not, why do people do it, as opposed to something like context.write(new Text(itr.nextToken()), one);

You're correct in that the instantiation of a Text object is not heavy. However if you're processing billions of records then you want to shave every possible nanosecond per record. Every time you create a new Text object, Java has to allocate memory for it, keep track of it, and then garbage collect it at some point. This time really can add up in big jobs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM