[英]With Hadoop, how to change the number of mappers for a given job?
So, I have two jobs, Job A and Job B. For Job A, I would like to have a maximum of 6 mappers per node. 因此,我有两个作业,作业A和作业B。对于作业A,我希望每个节点最多有6个映射器。 However, Job B is a little different.
但是,作业B有点不同。 For Job B, I can only run one mapper per node.
对于作业B,每个节点只能运行一个映射器。 The reason for this isn't important -- let's just say this requirement is non-negotiable.
这样做的原因并不重要-可以说这个要求是不可谈判的。 I would like to tell Hadoop, "For Job A, schedule a maximum of 6 mappers per node. But for Job B, schedule a maximum of 1 mapper per node."
我想告诉Hadoop,“对于作业A,每个节点最多调度6个映射器。但是对于作业B,每个节点最多调度1个映射器。” Is this possible at all?
这有可能吗?
The only solution I can think of is : 我能想到的唯一解决方案是:
1) Have two folders off the main hadoop folder, conf.JobA and conf.JobB. 1)在主hadoop文件夹下有两个文件夹conf.JobA和conf.JobB。 Each folder has its own copy of mapred-site.xml.
每个文件夹都有其自己的mapred-site.xml副本。 conf.JobA/mapred-site.xml has a value of 6 for mapred.tasktracker.map.tasks.maximum.
conf.JobA / mapred-site.xml的mapred.tasktracker.map.tasks.maximum值为6。 conf.JobB/mapred-site.xml has a value of 1 for mapred.tasktracker.map.tasks.maximum.
conf.JobB / mapred-site.xml的mapred.tasktracker.map.tasks.maximum值为1。
2) Before I run Job A : 2)在运行作业A之前:
2a) Shut down my tasktrackers 2a)关闭我的任务跟踪器
2b) Copy conf.JobA/mapred-site.xml into Hadoop's conf folder, replacing the mapred-site.xml that was already in there 2b)将conf.JobA / mapred-site.xml复制到Hadoop的conf文件夹中,替换其中已经存在的mapred-site.xml
2c) Restart my tasktrackers 2c)重新启动我的任务跟踪器
2d) Wait for the tasktrackers to finish starting 2d)等待任务跟踪器完成启动
3) Run Job A 3)运行作业A
and then do a similar thing when I need to run Job B. 然后在需要运行作业B时执行类似的操作。
I really don't like this solution; 我真的不喜欢这种解决方案。 it seems kludgey and failure-prone.
似乎很容易出错且容易失败。 Is there a better way to do what I need to do?
有没有更好的方法来做我需要做的事情?
In your Java code for the custom jar itself you could set this configuration mapred.tasktracker.map.tasks.maximum
for both of your jobs. 在自定义jar本身的Java代码中,您可以为两个作业都设置此配置
mapred.tasktracker.map.tasks.maximum
。
Do something like this: 做这样的事情:
Configuration conf = getConf();
// set number of mappers
conf.setInt("mapred.tasktracker.map.tasks.maximum", 4);
Job job = new Job(conf);
job.setJarByClass(MyMapRed.class);
job.setJobName(JOB_NAME);
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(MapJob.class);
job.setMapOutputKeyClass(Text.class);
job.setReducerClass(ReduceJob.class);
job.setMapOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, args[0]);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
EDIT : 编辑 :
You also need to set the property mapred.map.tasks
to the value derived from the following formula ( mapred.tasktracker.map.tasks.maximum * Number of tasktracker Nodes in your cluster) . 您还需要将属性
mapred.map.tasks
设置为从以下公式得出的值(mapred.tasktracker.map.tasks.maximum *集群中的tasktracker节点数)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.