简体   繁体   English

Hadoop映射器/减速器重复使用

[英]Hadoop mapper/reducer re-use

How do mapper/reducer instances get re-used within a jvm that's kept alive perpetually? mapper / reducer实例如何在一个永久保存的jvm中重用?

For example, let's say I wanted to do something like this: 例如,假设我想做这样的事情:

public class MyMapper extends MapReduceBase implements Mapper<K1, V1, K2, V2> {

    private Set<String> set = new HashSet<String>();

    public void map(K1 k1, V1 v1, OutputCollector<K2, V2> output, Reporter reporter) {
        ... do stuff ...

        set.add(k1.toString()); //add something to a list so that it can be used later

        ... do other stuff ...


        if(set.contains("someString"))
            emitSomeKindOfOutput(output);
        else
            emitSomeOtherKindOfOutput(output);
    }

}

If the same mapper can be used for multiple tasks/jobs, then the member set could cause problems because it would still contain other junk from previous tasks/jobs. 如果相同的映射器可以用于多个任务/作业,那么成员集可能会导致问题,因为它仍然包含来自先前任务/作业的其他垃圾。 Is this kind of re-use possible in hadoop? 在hadoop中这种重用是否可行? What about for reducers? 减速器怎么样?

You are definitely safe. 你绝对安全。 Mapper and reducer instance are not reused. Mapper和reducer实例不会被重用。 If you need to perform some initialization or cleanup you can override the two methods configure and close provided by MapReduceBase. 如果需要执行一些初始化或清理,可以覆盖MapReduceBase提供的两种方法configureclose This is not required by your code sample. 您的代码示例不需要这样做。

If set was a static variable then you would have to clear it in the close() method to be safe, even if not required by most site configuration (basically a new JVM is forked for each map by default, you have to configure reuse.jvm.num.tasks to enable JVM reuse). 如果set是一个静态变量,那么你必须在close()方法中clear它才是安全的,即使大多数站点配置都不需要(基本上默认情况下为每个映射分叉一个新的JVM,你必须配置reuse.jvm.num.tasks以启用JVM重用)。 Two map tasks are never run concurrently in the same JVM. 两个映射任务永远不会在同一个JVM中同时运行。

As far as I know, Hadoop is based on a shared nothing architecture and so your 'private Set set' variable won't get shared among different mappers. 据我所知,Hadoop基于无共享架构,因此您的“私有Set set”变量不会在不同的映射器之间共享。 So, there shouldn't be any question of getting, as you mentioned - 'junk from previous mappers'. 所以,正如你所提到的那样,不应该有任何问题 - “来自之前的地图制作者的垃圾”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM