简体   繁体   中英

HBase mapreduce job - Multiple scans - How to set the table of each Scan

I use HBase 1.2. I would like to run a MapReduce job on HBase using multiple scans. In the API, there is : TableMapReduceUtil.initTableMapperJob(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) .

But how to specify the table of each scan ? I use the code below :

List<Scan> scans = new ArrayList<>();
for (String firstPart : firstParts) {
    Scan scan = new Scan();
    scan.setRowPrefixFilter(Bytes.toBytes(firstPart));
    scan.setCaching(500);
    scan.setCacheBlocks(false);
    scans.add(scan);
}
TableMapReduceUtil.initTableMapperJob(scans, MyMapper.class, Text.class, Text.class, job);

It gives the following Exception

Exception in thread "main" java.lang.NullPointerException
        at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:436)
        at org.apache.hadoop.hbase.mapreduce.TableInputFormat.initialize(TableInputFormat.java:184)
        at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:241)
        at org.apache.hadoop.hbase.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:240)
        at org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:115)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:305)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:322)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1307)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1304)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1714)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1304)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1325)

I think it's normal since the tables on which each scan should be applied are not specified anywhere.

But how to do it ?

I tried to add

scan.setAttribute("scan.attributes.table.name", Bytes.toBytes("my_table"));

but it gives the same error

From docs https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableInputFormat.html

List<Scan> scans = new ArrayList<Scan>();

 Scan scan1 = new Scan();
 scan1.setStartRow(firstRow1);
 scan1.setStopRow(lastRow1);
 scan1.setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME, table1);
 scans.add(scan1);

 Scan scan2 = new Scan();
 scan2.setStartRow(firstRow2);
 scan2.setStopRow(lastRow2);
 scan1.setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME, table2);
 scans.add(scan2);

 TableMapReduceUtil.initTableMapperJob(scans, TableMapper.class, Text.class,
     IntWritable.class, job);

Your Case :

use Scan.SCAN_ATTRIBUTES_TABLE_NAME since you are not setting table at scan instance level you are getting this NPE...

please follow this example where you have to set the table name inside your for loop not outside ... then it should work

List<Scan> scans = new ArrayList<Scan>();

  for(int i=0; i<3; i++){
    Scan scan = new Scan();

    scan.addFamily(INPUT_FAMILY);
    scan.setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME, Bytes.toBytes(TABLE_NAME ));

    if (start != null) {
      scan.setStartRow(Bytes.toBytes(start));
    }
    if (stop != null) {
      scan.setStopRow(Bytes.toBytes(stop));
    }

    scans.add(scan);

    LOG.info("scan before: " + scan);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM