HBase mapreduce作业-多次扫描-如何设置每次扫描的表

Question

我使用HBase 1.2。 我想使用多次扫描在HBase上运行MapReduce作业。 在API中，有： TableMapReduceUtil.initTableMapperJob(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) 。

但是如何指定每次扫描的表呢？ 我使用下面的代码：

List<Scan> scans = new ArrayList<>();
for (String firstPart : firstParts) {
    Scan scan = new Scan();
    scan.setRowPrefixFilter(Bytes.toBytes(firstPart));
    scan.setCaching(500);
    scan.setCacheBlocks(false);
    scans.add(scan);
}
TableMapReduceUtil.initTableMapperJob(scans, MyMapper.class, Text.class, Text.class, job);

它给出以下异常

Exception in thread "main" java.lang.NullPointerException
        at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:436)
        at org.apache.hadoop.hbase.mapreduce.TableInputFormat.initialize(TableInputFormat.java:184)
        at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:241)
        at org.apache.hadoop.hbase.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:240)
        at org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:305)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:322)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1307)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1304)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1714)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1304)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1325)

我认为这是正常的，因为未在任何地方指定应进行每次扫描的表。

但是怎么做呢？

我试图添加

scan.setAttribute("scan.attributes.table.name", Bytes.toBytes("my_table"));

但它给出了相同的错误

Answer 1

从文档https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableInputFormat.html

List<Scan> scans = new ArrayList<Scan>();

 Scan scan1 = new Scan();
 scan1.setStartRow(firstRow1);
 scan1.setStopRow(lastRow1);
 scan1.setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME, table1);
 scans.add(scan1);

 Scan scan2 = new Scan();
 scan2.setStartRow(firstRow2);
 scan2.setStopRow(lastRow2);
 scan1.setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME, table2);
 scans.add(scan2);

 TableMapReduceUtil.initTableMapperJob(scans, TableMapper.class, Text.class,
     IntWritable.class, job);

您的情况：

使用Scan.SCAN_ATTRIBUTES_TABLE_NAME因为您没有在扫描实例级别设置表，所以得到了这个NPE ...

请按照以下示例操作，在此示例中，您必须在for循环内而不是外部设置表名...然后它应该可以

List<Scan> scans = new ArrayList<Scan>();

  for(int i=0; i<3; i++){
    Scan scan = new Scan();

    scan.addFamily(INPUT_FAMILY);
    scan.setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME, Bytes.toBytes(TABLE_NAME ));

    if (start != null) {
      scan.setStartRow(Bytes.toBytes(start));
    }
    if (stop != null) {
      scan.setStopRow(Bytes.toBytes(stop));
    }

    scans.add(scan);

    LOG.info("scan before: " + scan);

HBase mapreduce作业-多次扫描-如何设置每次扫描的表

问题描述

1 个解决方案

解决方案1
2 2017-03-15 13:36:11

从文档https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableInputFormat.html

您的情况：

HBase mapreduce作业-多次扫描-如何设置每次扫描的表

问题描述

1 个解决方案

解决方案1 2 2017-03-15 13:36:11

从文档https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableInputFormat.html

您的情况：

解决方案1
2 2017-03-15 13:36:11