HBase多表掃描作業

Question

我正在看下面的場景。 我每天都會發送一個數據文件。 我將其添加到名為file-yyyyMMdd格式的HBase中。 所以在一段時間內我有很多數據庫，例如

tempdb-20121220
tempdb-20121221
tempdb-20121222
tempdb-20121223
tempdb-20121224
tempdb-20121225

現在我想要做的是針對特定日期范圍獲取列表，如果表匹配該范圍，以便我可以創建索引。 我使用的是hbase-0.90.6

就我的研究而言，TableMapReduceUtil.initTableMapperJob僅占用1個tableName。

TableMapReduceUtil.initTableMapperJob(
tableName,        // input HBase table name
scan,             // Scan instance to control CF and attribute selection
HBaseIndexerMapper.class,   // mapper
null,             // mapper output key
null,             // mapper output value
job
);

我已經能夠得到表的列表並在循環中運行它，但我的想法是我可以遍歷所有表，掃描它（或其他東西），以便最終我可以得到合並/組合結果用於索引目的。

實現這一目標的任何方向都是偉大而有益的。

Answer 1

好的，請檢查HBase 0.94.6來源（看起來他們離您最近）。 在那里，您將找到MultiTableInputFormat類（按照鏈接查看JavaDoc，包括示例），它可以滿足您的需求。 就在幾天前，我有經驗將這個類添加到基於HBase 0.94.2 （實際上是CDH 4.2.1 ）的項目中。 全成。

這似乎完全符合您的需求並且非常有效。 這里唯一的問題是你將有一個映射器處理所有數據。 要區分表，您可能需要從0.94.6獲取TableSplit類，重命名它有點不同，端口不會破壞您的環境。 請檢查TableMapReduceUtil中的差異 - 您需要手動配置掃描，因此輸入格式將了解其配置。

還可以考慮簡單地轉移到HBase 0.94.6 - 更容易的方式，因為我無法遵循它。 我花了大約12個工作小時來理解這里的問題/調查解決方案/了解我的問題與CDH 4.2.1 /端口的一切。 對我來說，好消息是Cloudera打算在CDH 4.3.0中升至0.94.6。

UPDATE1： CDH 4.3.0可用，它包括HBase 0.94.6以及所有必需的基礎設施。

UPDATE2：我轉移到其他解決方案 - 自定義輸入格式，它結合了幾個HBase表按鍵混合它們的行。 發生了非常有用，特別是在正確的鍵設計。 您可以在單個映射器中獲得整個聚合。 我正在考慮在github上發布這段代碼。

Answer 2

List<scans>也是一種方式。 我也同意MultipleTableInputFormat：

import java.util.List; 
import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.conf.Configured; 
import org.apache.hadoop.hbase.client.Scan; 
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil; 
import org.apache.hadoop.hbase.util.Bytes; 
import org.apache.hadoop.io.IntWritable; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.mapreduce.Job; 
import org.apache.hadoop.util.Tool; 

 public class TestMultiScan extends Configured implements Tool { 

    @Override 
    public int run(String[] arg0) throws Exception { 
        List<Scan> scans = new ArrayList<Scan>(); 


        Scan scan1 = new Scan(); 
        scan1.setAttribute("scan.attributes.table.name", Bytes.toBytes("table1ddmmyyyy")); 
        System.out.println(scan1.getAttribute("scan.attributes.table.name")); 
        scans.add(scan1); 


        Scan scan2 = new Scan(); 
        scan2.setAttribute("scan.attributes.table.name", Bytes.toBytes("table2ddmmyyyy")); 
        System.out.println(scan2.getAttribute("scan.attributes.table.name")); 
        scans.add(scan2); 


        Configuration conf = new Configuration(); 
        Job job = new Job(conf);     
        job.setJarByClass(TestMultiScan.class); 


        TableMapReduceUtil.initTableMapperJob( 
                scans,  
                MultiTableMappter.class,  
                Text.class,  
                IntWritable.class,  
                job); 
        TableMapReduceUtil.initTableReducerJob( 
                "xxxxx", 
                MultiTableReducer.class,  
                job); 
        job.waitForCompletion(true); 
        return 0; 
    } 

    public static void main(String[] args) throws Exception { 
        TestMultiScan runJob = new TestMultiScan(); 
        runJob.run(args); 
    } 
 }

通過這種方式，我們使用HBASE命名空間表解決了我們的多租戶需求。 例如：DEV1：TABLEX（由DEV1引入的數據）UAT1：在mapper中的TABLEX（由UAT1消耗的數據）我們想比較兩個命名空間表以進一步繼續。

在內部，它使用了Table TableReduceUtil.java中顯示的Multiple Table InputFormat

HBase多表掃描作業

問題描述

2 個解決方案

解決方案1
3 2013-05-19 18:21:32

解決方案2
1 2015-11-07 17:10:58

HBase多表掃描作業

問題描述

2 個解決方案

解決方案1 3 2013-05-19 18:21:32

解決方案2 1 2015-11-07 17:10:58

解決方案1
3 2013-05-19 18:21:32

解決方案2
1 2015-11-07 17:10:58