简体   繁体   English

HBase多表扫描作业

[英]HBase multiple table scans for the job

I am looking at the following scenario. 我正在看下面的场景。 I have a data file sent daily. 我每天都会发送一个数据文件。 I add it into HBase with the name file-yyyyMMdd format. 我将其添加到名为file-yyyyMMdd格式的HBase中。 So over a period of time i have many databases eg 所以在一段时间内我有很多数据库,例如

tempdb-20121220
tempdb-20121221
tempdb-20121222
tempdb-20121223
tempdb-20121224
tempdb-20121225

Now what I want to do is for a specific date range get the list if tables matching that range so that i can create indexes. 现在我想要做的是针对特定日期范围获取列表,如果表匹配该范围,以便我可以创建索引。 I am using hbase-0.90.6 我使用的是hbase-0.90.6

As far as my research goes the TableMapReduceUtil.initTableMapperJob takes only 1 tableName. 就我的研究而言,TableMapReduceUtil.initTableMapperJob仅占用1个tableName。

TableMapReduceUtil.initTableMapperJob(
tableName,        // input HBase table name
scan,             // Scan instance to control CF and attribute selection
HBaseIndexerMapper.class,   // mapper
null,             // mapper output key
null,             // mapper output value
job
);

I have been able to get the list of tables and run it in a loop but the idea is that i can loop through all the tables, scan it (or something else) so that ultimately i can get the merges/combined results for indexing purposes. 我已经能够得到表的列表并在循环中运行它,但我的想法是我可以遍历所有表,扫描它(或其他东西),以便最终我可以得到合并/组合结果用于索引目的。

Any direction to achieve this would be great and helpful. 实现这一目标的任何方向都是伟大而有益的。

OK, Please check HBase 0.94.6 sources (looks like they are most close for you). 好的,请检查HBase 0.94.6来源(看起来他们离您最近)。 There you will find MultiTableInputFormat class (follow the link to see JavaDoc including example) which does what you need. 在那里,您将找到MultiTableInputFormat类 (按照链接查看JavaDoc,包括示例),它可以满足您的需求。 Just few days ago I had experience adding this class to HBase 0.94.2 (actually CDH 4.2.1 ) based project. 就在几天前,我有经验将这个类添加到基于HBase 0.94.2 (实际上是CDH 4.2.1 )的项目中。 Successfull. 全成。

This seems to do exactly what you need and in very efficient way. 这似乎完全符合您的需求并且非常有效。 The only issue here is you will have one mapper processing all the data. 这里唯一的问题是你将有一个映射器处理所有数据。 To distinguish tables you probably need to take TableSplit class from 0.94.6, rename it somewhat differently and port to not break your environment. 要区分表,您可能需要从0.94.6获取TableSplit类,重命名它有点不同,端口不会破坏您的环境。 And please check differences in TableMapReduceUtil - you will need to manually configure your Scans so input format will understand their configuration. 请检查TableMapReduceUtil中的差异 - 您需要手动配置扫描,因此输入格式将了解其配置。

Also consider simply moving to HBase 0.94.6 - much easier way by I was not able to follow it. 还可以考虑简单地转移到HBase 0.94.6 - 更容易的方式,因为我无法遵循它。 It has taken from me about 12 working hours to understand issues here / investigate solutions / understand my issue with CDH 4.2.1 / port everything. 我花了大约12个工作小时来理解这里的问题/调查解决方案/了解我的问题与CDH 4.2.1 /端口的一切。 For me good news are Cloudera intends to move to 0.94.6 in CDH 4.3.0. 对我来说,好消息是Cloudera打算在CDH 4.3.0中升至0.94.6。

UPDATE1: CDH 4.3.0 is available and it includes HBase 0.94.6 with all required infrastructure. UPDATE1: CDH 4.3.0可用,它包括HBase 0.94.6以及所有必需的基础设施。

UPDATE2: I moved to other solution - custom input format which combines several HBase tables intermixing their rows by key. UPDATE2:我转移到其他解决方案 - 自定义输入格式,它结合了几个HBase表按键混合它们的行。 Happened to be very useful, especially with proper key design. 发生了非常有用,特别是在正确的键设计。 You get whole aggregates in single mapper. 您可以在单个映射器中获得整个聚合。 I'm considering posting this code on github. 我正在考虑在github上发布这段代码。

List<scans> is also a way. List<scans>也是一种方式。 I agree with MultipleTableInputFormat as well: 我也同意MultipleTableInputFormat:

import java.util.List; 
import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.conf.Configured; 
import org.apache.hadoop.hbase.client.Scan; 
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil; 
import org.apache.hadoop.hbase.util.Bytes; 
import org.apache.hadoop.io.IntWritable; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.mapreduce.Job; 
import org.apache.hadoop.util.Tool; 

 public class TestMultiScan extends Configured implements Tool { 

    @Override 
    public int run(String[] arg0) throws Exception { 
        List<Scan> scans = new ArrayList<Scan>(); 


        Scan scan1 = new Scan(); 
        scan1.setAttribute("scan.attributes.table.name", Bytes.toBytes("table1ddmmyyyy")); 
        System.out.println(scan1.getAttribute("scan.attributes.table.name")); 
        scans.add(scan1); 


        Scan scan2 = new Scan(); 
        scan2.setAttribute("scan.attributes.table.name", Bytes.toBytes("table2ddmmyyyy")); 
        System.out.println(scan2.getAttribute("scan.attributes.table.name")); 
        scans.add(scan2); 


        Configuration conf = new Configuration(); 
        Job job = new Job(conf);     
        job.setJarByClass(TestMultiScan.class); 


        TableMapReduceUtil.initTableMapperJob( 
                scans,  
                MultiTableMappter.class,  
                Text.class,  
                IntWritable.class,  
                job); 
        TableMapReduceUtil.initTableReducerJob( 
                "xxxxx", 
                MultiTableReducer.class,  
                job); 
        job.waitForCompletion(true); 
        return 0; 
    } 

    public static void main(String[] args) throws Exception { 
        TestMultiScan runJob = new TestMultiScan(); 
        runJob.run(args); 
    } 
 } 

In this way we have solved our multi tenancy requirements with HBASE namespaced tables. 通过这种方式,我们使用HBASE命名空间表解决了我们的多租户需求。 for ex: DEV1:TABLEX(DATA INGESTED BY DEV1) UAT1:TABLEX (DATA CONSUMED BY UAT1) in mapper we want to compare both namespace table to proceed further. 例如:DEV1:TABLEX(由DEV1引入的数据)UAT1:在mapper中的TABLEX(由UAT1消耗的数据)我们想比较两个命名空间表以进一步继续。

Internally it used Multiple Table InputFormat as shown in the TableMapReduceUtil.java 在内部,它使用了Table TableReduceUtil.java中显示的Multiple Table InputFormat

TableMapReduceUtil内部使用MultiTableInputFormat

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM