HBase multiple table scans for the job

Question

I am looking at the following scenario. I have a data file sent daily. I add it into HBase with the name file-yyyyMMdd format. So over a period of time i have many databases eg

tempdb-20121220
tempdb-20121221
tempdb-20121222
tempdb-20121223
tempdb-20121224
tempdb-20121225

Now what I want to do is for a specific date range get the list if tables matching that range so that i can create indexes. I am using hbase-0.90.6

As far as my research goes the TableMapReduceUtil.initTableMapperJob takes only 1 tableName.

TableMapReduceUtil.initTableMapperJob(
tableName,        // input HBase table name
scan,             // Scan instance to control CF and attribute selection
HBaseIndexerMapper.class,   // mapper
null,             // mapper output key
null,             // mapper output value
job
);

I have been able to get the list of tables and run it in a loop but the idea is that i can loop through all the tables, scan it (or something else) so that ultimately i can get the merges/combined results for indexing purposes.

Any direction to achieve this would be great and helpful.

Answer 1

OK, Please check HBase 0.94.6 sources (looks like they are most close for you). There you will find MultiTableInputFormat class (follow the link to see JavaDoc including example) which does what you need. Just few days ago I had experience adding this class to HBase 0.94.2 (actually CDH 4.2.1 ) based project. Successfull.

This seems to do exactly what you need and in very efficient way. The only issue here is you will have one mapper processing all the data. To distinguish tables you probably need to take TableSplit class from 0.94.6, rename it somewhat differently and port to not break your environment. And please check differences in TableMapReduceUtil - you will need to manually configure your Scans so input format will understand their configuration.

Also consider simply moving to HBase 0.94.6 - much easier way by I was not able to follow it. It has taken from me about 12 working hours to understand issues here / investigate solutions / understand my issue with CDH 4.2.1 / port everything. For me good news are Cloudera intends to move to 0.94.6 in CDH 4.3.0.

UPDATE1: CDH 4.3.0 is available and it includes HBase 0.94.6 with all required infrastructure.

UPDATE2: I moved to other solution - custom input format which combines several HBase tables intermixing their rows by key. Happened to be very useful, especially with proper key design. You get whole aggregates in single mapper. I'm considering posting this code on github.

Answer 2

List<scans> is also a way. I agree with MultipleTableInputFormat as well:

import java.util.List; 
import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.conf.Configured; 
import org.apache.hadoop.hbase.client.Scan; 
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil; 
import org.apache.hadoop.hbase.util.Bytes; 
import org.apache.hadoop.io.IntWritable; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.mapreduce.Job; 
import org.apache.hadoop.util.Tool; 

 public class TestMultiScan extends Configured implements Tool { 

    @Override 
    public int run(String[] arg0) throws Exception { 
        List<Scan> scans = new ArrayList<Scan>(); 


        Scan scan1 = new Scan(); 
        scan1.setAttribute("scan.attributes.table.name", Bytes.toBytes("table1ddmmyyyy")); 
        System.out.println(scan1.getAttribute("scan.attributes.table.name")); 
        scans.add(scan1); 


        Scan scan2 = new Scan(); 
        scan2.setAttribute("scan.attributes.table.name", Bytes.toBytes("table2ddmmyyyy")); 
        System.out.println(scan2.getAttribute("scan.attributes.table.name")); 
        scans.add(scan2); 


        Configuration conf = new Configuration(); 
        Job job = new Job(conf);     
        job.setJarByClass(TestMultiScan.class); 


        TableMapReduceUtil.initTableMapperJob( 
                scans,  
                MultiTableMappter.class,  
                Text.class,  
                IntWritable.class,  
                job); 
        TableMapReduceUtil.initTableReducerJob( 
                "xxxxx", 
                MultiTableReducer.class,  
                job); 
        job.waitForCompletion(true); 
        return 0; 
    } 

    public static void main(String[] args) throws Exception { 
        TestMultiScan runJob = new TestMultiScan(); 
        runJob.run(args); 
    } 
 }

In this way we have solved our multi tenancy requirements with HBASE namespaced tables. for ex: DEV1:TABLEX(DATA INGESTED BY DEV1) UAT1:TABLEX (DATA CONSUMED BY UAT1) in mapper we want to compare both namespace table to proceed further.

Internally it used Multiple Table InputFormat as shown in the TableMapReduceUtil.java

HBase multiple table scans for the job

Question

2 answers

solution1
3 2013-05-19 18:21:32

solution2
1 2015-11-07 17:10:58

HBase multiple table scans for the job

Question

2 answers

solution1 3 2013-05-19 18:21:32

solution2 1 2015-11-07 17:10:58

solution1
3 2013-05-19 18:21:32

solution2
1 2015-11-07 17:10:58