简体   繁体   English

在使用MapReduce进行HBase扫描期间,Reducer的数量始终为1

[英]during HBase scan with MapReduce, the number of Reducer is always one

I do HBase scan in Mapper, then Reducer writes result to HDFS. 我在Mapper中进行HBase扫描,然后Reducer将结果写入HDFS。
The number of records output by mapper is roughly 1,000,000,000. 映射器输出的记录数大约为1,000,000,000。

The problem is the number of reducers is always one, though I have set -Dmapred.reduce.tasks=100 . 问题是尽管我已设置-Dmapred.reduce.tasks=100 ,但减速器的数量始终为一。 The reduce process is very slow. 还原过程非常缓慢。

// edit at 2016-12-04 by 祝方泽 //编辑于2016-12-04祝方泽
the code of my main class: 我的主要班级的代码:

public class GetUrlNotSent2SpiderFromHbase extends Configured implements Tool {

public int run(String[] arg0) throws Exception {

    Configuration conf = getConf();
    Job job = new Job(conf, conf.get("mapred.job.name"));
    String input_table = conf.get("input.table");       

    job.setJarByClass(GetUrlNotSent2SpiderFromHbase.class);

    Scan scan = new Scan();
    scan.setCaching(500);
    scan.setCacheBlocks(false);
    scan.addColumn(Bytes.toBytes("info"), Bytes.toBytes("sitemap_type"));
    scan.addColumn(Bytes.toBytes("info"), Bytes.toBytes("is_send_to_spider"));

    TableMapReduceUtil.initTableMapperJob(
            input_table, 
            scan, 
            GetUrlNotSent2SpiderFromHbaseMapper.class, 
            Text.class, 
            Text.class, 
            job);

    /*job.setMapperClass(GetUrlNotSent2SpiderFromHbaseMapper.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);*/

    job.setReducerClass(GetUrlNotSent2SpiderFromHbaseReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);

    if (job.waitForCompletion(true) && job.isSuccessful()) {
        return 0;
    }
    return -1;
}

public static void main(String[] args) throws Exception {
    Configuration conf = HBaseConfiguration.create();
    int res = ToolRunner.run(conf, new GetUrlNotSent2SpiderFromHbase(), args);
    System.exit(res);
}

}

here is the script to run this MapReduce job: 这是运行此MapReduce作业的脚本:

table="xxx"
output="yyy"
sitemap_type="zzz"

JOBCONF=""
JOBCONF="${JOBCONF} -Dmapred.job.name=test_for_scan_hbase"
JOBCONF="${JOBCONF} -Dinput.table=$table"
JOBCONF="${JOBCONF} -Dmapred.output.dir=$output"
JOBCONF="${JOBCONF} -Ddemand.sitemap.type=$sitemap_type"
JOBCONF="${JOBCONF} -Dyarn.app.mapreduce.am.command-opts='-Xmx8192m'"
JOBCONF="${JOBCONF} -Dyarn.app.mapreduce.am.resource.mb=9216"
JOBCONF="${JOBCONF} -Dmapreduce.map.java.opts='-Xmx1536m'"
JOBCONF="${JOBCONF} -Dmapreduce.map.memory.mb=2048"
JOBCONF="${JOBCONF} -Dmapreduce.reduce.java.opts='-Xmx1536m'"
JOBCONF="${JOBCONF} -Dmapreduce.reduce.memory.mb=2048"
JOBCONF="${JOBCONF} -Dmapred.reduce.tasks=100"
JOBCONF="${JOBCONF} -Dmapred.job.priority=VERY_HIGH"

hadoop fs -rmr $output
hadoop jar get_url_not_sent_2_spider_from_hbase_hourly.jar hourly.GetUrlNotSent2SpiderFromHbase $JOBCONF
echo "===== scan HBase finished ====="

I set job.setNumReduceTasks(100); 我设置了job.setNumReduceTasks(100); in code, it worked. 在代码中,它起作用了。

Since you mentioned only one reduce is working that's the obvious reason why reducer is very slow. 既然您提到只有一个reduce起作用,这就是减速器非常慢的明显原因。

Unified way to know configuration properties applied to Job (this you call for every job you execute to know parameters are passed correctly) : 知道应用到Job的配置属性的统一方法(您对执行的每个Job进行调用以知道正确传递了参数):

add the below method to your job driver mentioned above to print configuration entries applied from all possible sources ie either from -D or some where else please add this method call in driver program before your job is submitted : 将以下方法添加到上述作业驱动程序中,以打印从所有可能的来源应用的配置条目,即从-D或其他来源应用的配置条目,请在提交作业之前在驱动程序中添加此方法调用:

public static void printConfigApplied(Configuration conf) 
     try {
                conf.writeXml(System.out);
            } catch (final IOException e) {
                e.printStackTrace();
            }
}

This proves your system properties are not applied from the command line ie -Dxxx so the way you are passing system properties is not correct. 这证明您的系统属性未从命令行即-Dxxx应用,因此传递系统属性的方式不正确。 since pro grammatically. 从语法上讲

Since job.setnumreducetasks is working , I strongly suspect the below where your system properties are not passed correctly to driver. 由于job.setnumreducetasks正在运行, 我强烈怀疑以下系统属性未正确传递给驱动程序的情况。

 Configuration conf = getConf();
    Job job = new Job(conf, conf.get("mapred.job.name"));

change this to the example in this 改变这种在例如

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM