简体   繁体   中英

Why do we need setup() method in MapReduce when we can initialize parameters in map() or reduce()?

I am new to Hadoop and overall MapReduce paradigm. I searched a lot on the web regarding overriding the setup() method in Map class to access the configuration object. But from what I read, it seems that the setup() method is anyways called every time a task is run.

So why is the need for a seperate method to access configuration object and initialize parameters? Why cant we do the same directly in map() or reduce() methods?

Though both the approaches will give output as required in the end, is there a performance factor that comes into picture while choosing any one approach? Thanks in advance.

the answer lies not in Hadoop, but in programming paradigm in my opinion. It is always good to separate different parts of the business logic, and setting up the running environment is different then running the map itself.

Imagine a scenario when you have certain data on which you wish to do multiple calculations, in this case if you have a parent class for your jobs, in which you can do the common setup phases by overriding a separate method it is better.

The design just encourages this behaviour which you would choose otherwise as well.

您必须检查map()reduce()是否已初始化参数,以便通过划分初始化实际映射逻辑阶段来简化初始化过程。

I'm not sure if I'm right but as far as I understand map() and reduce() are executed in nodes in distributed network where nodes do not have knowledge about whole system. So what you have access inside map() reduce() methods is not what is configured in main node. You can't just have access to whole configuration in node because it means you need to connect to main node whole time.

Re: "it seems that the setup() method is anyways called every time a task is run."

Whenever a task is run, number of records are processed by the corresponding Map or Reduce task. The map() or reduce() method is called for every record being processed. However setup() method is run once per task giving you opporunity to optimize the workflow by initializing configurations/resources such as ( Database connection, reading a reference file etc.) only once per all the records being processed by that task.

Similarly, the API provides a callback named "cleanup" where you can clean up the resources. This will be invoked when the task has finished processing records allocated for that task.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM