简体   繁体   English

在 Oozie 工作流中的 MapReduce 作业中设置 Reducer 数量

[英]Setting the Number of Reducers in a MapReduce job which is in an Oozie Workflow

I have a five node cluster, three nodes of which contain DataNodes and TaskTrackers.我有一个五节点集群,其中三个节点包含 DataNodes 和 TaskTrackers。

I've imported around 10million rows from Oracle via Sqoop and process it via MapReduce in an Oozie workflow.我已经通过 Sqoop 从 Oracle 导入了大约 1000 万行,并在 Oozie 工作流中通过 MapReduce 处理它。

The MapReduce job takes about 30 minutes and is only using one reducer. MapReduce 作业大约需要 30 分钟,并且只使用一个减速器。

Edit - If I run the MapReduce code on its own, separate from Oozie, the job.setNumReduceTasks(4) correctly establishes 4 reducers.编辑 - 如果我自己运行 MapReduce 代码,与 Oozie 分开,则job.setNumReduceTasks(4)正确建立 4 个减速器。

I have tried the following methods to manually set the number of reducers to four, with no success:我尝试了以下方法手动将减速器的数量设置为四个,但没有成功:

In Oozie, set the following property in the tag of the map reduce node:在 Oozie 中,在 map reduce 节点的标签中设置以下属性:

<property><name>mapred.reduce.tasks</name><value>4</value></property>

In the MapReduce java code's Main method:在 MapReduce java 代码的 Main 方法中:

Configuration conf = new Configuration();
Job job = new Job(conf, "10 million rows");
...
job.setNumReduceTasks(4);

I also tried:我也试过:

Configuration conf = new Configuration();
Job job = new Job(conf, "10 million rows");
...
conf.set("mapred.reduce.tasks", "4");

My map function looks similar to this:我的地图功能看起来与此类似:

public void map(Text key, Text value, Context context) {
    CustomObj customObj = new CustomObj(key.toString());
    context.write(new Text(customObj.getId()), customObj);  
}

I think there are something like 80,000 different values for the ID.我认为 ID 有 80,000 个不同的值。

My Reduce function looks similar to this:我的 Reduce 函数类似于:

public void reduce(Text key, Iterable<CustomObj> vals, Context context) {
    OtherCustomObj otherCustomObj = new OtherCustomObj();
    ...
    context.write(null, otherCustomObj);
}

The custom object emitted in the Mapper implements WritableComparable, but the other custom objected emitted in the Reducer does not implement WritableComparable. Mapper 中发出的自定义对象实现了 WritableComparable,但 Reducer 中发出的另一个自定义对象没有实现 WritableComparable。

Here are the logs regarding the System counters, job counters, and map-reduce framework, where it specifies that only one reduce task was launched.以下是有关系统计数器、作业计数器和 map-reduce 框架的日志,其中指定仅启动了一个 reduce 任务。

 map 100% reduce 100%
 Job complete: job_201401131546_0425
 Counters: 32
   File System Counters
     FILE: Number of bytes read=1370377216
     FILE: Number of bytes written=2057213222
     FILE: Number of read operations=0
     FILE: Number of large read operations=0
     FILE: Number of write operations=0
     HDFS: Number of bytes read=556345690
     HDFS: Number of bytes written=166938092
     HDFS: Number of read operations=18
     HDFS: Number of large read operations=0
     HDFS: Number of write operations=1
   Job Counters 
     Launched map tasks=11
     Launched reduce tasks=1
     Data-local map tasks=11
     Total time spent by all maps in occupied slots (ms)=1268296
     Total time spent by all reduces in occupied slots (ms)=709774
     Total time spent by all maps waiting after reserving slots (ms)=0
     Total time spent by all reduces waiting after reserving slots (ms)=0
   Map-Reduce Framework
     Map input records=9440000
     Map output records=9440000
     Map output bytes=666308476
     Input split bytes=1422
     Combine input records=0
     Combine output records=0
     Reduce input groups=80000
     Reduce shuffle bytes=685188530
     Reduce input records=9440000
     Reduce output records=2612760
     Spilled Records=28320000
     CPU time spent (ms)=1849500
     Physical memory (bytes) snapshot=3581157376
     Virtual memory (bytes) snapshot=15008251904
     Total committed heap usage (bytes)=2848063488

Edit: I modified the MapReduce to introduce a custom partitioner, a sort comparator, and a grouping comparator.编辑:我修改了 MapReduce 以引入自定义分区器、排序比较器和分组比较器。 For some reason, the code now launches two reducers (when scheduled via Oozie), but not four.出于某种原因,代码现在启动了两个减速器(当通过 Oozie 调度时),但不是四个。

I set the mapred.tasktracker.map.tasks.maximum property to 20 on each TaskTracker (and JobTracker), restarted them but no result.我在每个 TaskTracker(和 JobTracker) mapred.tasktracker.map.tasks.maximum属性设置为 20,重新启动它们但没有结果。

Just as a starting point what is the value of the following property in the mapred-site.xml作为起点,mapred-site.xml 中以下属性的值是什么

<property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>4</value>
</property>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM