简体   繁体   English

多种范围的HBase扫描

[英]HBase Scan with Multiple Ranges

I have a HBase table, and I need to get the result from several ranges. 我有一个HBase表,我需要从几个范围中获取结果。 For example, I may need get data from different ranges like row 1-6, 100-150,..... I know that for each scan, I can define the start row and stop row. 例如,我可能需要从1-6行,100-150行等不同范围获取数据。我知道对于每次扫描,我都可以定义开始行和停止行。 But if I have 6 ranges, I need to do scan 6 times. 但是,如果我有6个范围,则需要扫描6次。 Is there any way that I can get the result from multiple ranges just from one scan or from one RPC? 有什么方法可以使我一次扫描或一次RPC从多个范围获得结果? My HBase version is 0.98. 我的HBase版本是0.98。

Filter to support scan multiple row key ranges. 过滤以支持扫描多个行键范围。 It can construct the row key ranges from the passed list which can be accessed by each region server. 它可以从传递的列表构造行键范围,每个区域服务器都可以访问该行键范围。

HBase is quite efficient when scanning only one small row key range. 仅扫描一个小的行键范围时,HBase效率很高。 If user needs to specify multiple row key ranges in one scan, the typical solutions are: 如果用户需要在一次扫描中指定多个行键范围,则典型的解决方案是:

  1. through FilterList which is a list of row key Filters, 通过FilterList(它是行键过滤器的列表),
  2. using the SQL layer over HBase to join with two table, such as hive, phoenix etc. However, both solutions are inefficient. 在HBase上使用SQL层来连接两个表,例如hive,phoenix等。但是,这两种解决方案都效率不高。

    Both of them can't utilize the range info to perform fast forwarding during scan which is quite time consuming. 它们都不能利用范围信息在扫描期间执行快速转发,这非常耗时。 If the number of ranges are quite big (eg millions), join is a proper solution though it is slow. 如果范围的数量很大(例如,数百万个),尽管连接很慢,但join是一个合适的解决方案。
    However, there are cases that user wants to specify a small number of ranges to scan (eg <1000 ranges). 但是,在某些情况下,用户希望指定少量扫描范围(例如<1000个范围)。 Both solutions can't provide satisfactory performance in such case. 在这种情况下,两种解决方案都无法提供令人满意的性能。

MultiRowRangeFilter is to support such usec ase (scan multiple row key ranges), which can construct the row key ranges from user MultiRowRangeFilter支持这种用例(扫描多个行键范围),可以从用户构造行键范围
specified list and perform fast-forwarding during scan. 指定的列表并在扫描过程中执行快速转发。 Thus, the scan will be quite efficient. 因此,扫描将非常有效。

package chengchen;

import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.filter.Filter;
import org.apache.hadoop.hbase.filter.MultiRowRangeFilter;
import org.apache.hadoop.hbase.filter.MultiRowRangeFilter.RowKeyRange;
import org.apache.hadoop.hbase.util.Bytes;



public class MultiRowRangeFilterTest {
    public static void main(String[] args) throws Exception {
        if (args.length < 1) {
            throw new Exception("Table name not specified.");
        }
        Configuration conf = HBaseConfiguration.create();
        HTable table = new HTable(conf, args[0]);

        TimeCounter executeTimer = new TimeCounter();
        executeTimer.begin();
        executeTimer.enter();
        Scan scan = new Scan();
        List<RowKeyRange> ranges = new ArrayList<RowKeyRange>();
        ranges.add(new RowKeyRange(Bytes.toBytes("001"), Bytes.toBytes("002")));
        ranges.add(new RowKeyRange(Bytes.toBytes("003"), Bytes.toBytes("004")));
        ranges.add(new RowKeyRange(Bytes.toBytes("005"), Bytes.toBytes("006")));
        Filter filter = new MultiRowRangeFilter(ranges);
        scan.setFilter(filter);
        int count = 0;
        ResultScanner scanner = table.getScanner(scan);
        Result r = scanner.next();
        while (r != null) {
            count++;
            r = scanner.next();
        }
        System.out
                .println("++ Scanning finished with count : " + count + " ++");
        scanner.close();


    }

}

Please see this test case for implementing in java 请参阅此测试用例以在Java中实现

Note : However, This kind of requirements SOLR or ES is the best way in my opinion... you can check my answer with solr for high level architecture overview. 注意:但是,我认为这种需求SOLR或ES是最好的方法...您可以使用solr查看我的回答以获取高级体系结构概述。 Im suggesting that since hbase scan for huge data will be very slow. 我建议,因为hbase扫描大量数据将非常慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM