简体   繁体   English

分层行键上的hbase Shell筛选器(或按行键长度筛选)

[英]hbase shell filter on hierarchical rowkey (or filter by rowkey length)

I have a hierarchical row-key design, where each character is an ID of a field (we use 4 byte segments but I will stick to double digits for readability) 我有一个分层的行键设计,其中每个字符都是一个字段的ID(我们使用4个字节的段,但为了可读性我会坚持使用两位数)

For example 例如

00 00
0000 = child of 00 0000 = 00的孩子
000000 = child of 0000 000000 = 0000的孩子
0001 = child of 00 0001 = 00的孩子
000100 = child of 0001 000100 = 0001的孩子

I would like to make a hbase shell query to return the children of a node. 我想进行hbase shell查询以返回节点的子代。

Right now I have the following 现在我有以下

scan 'tableName', STARTROW=>'00',
 FILTER=>"PrefixFilter('00') AND RowFilter(=,'regexstring:^00.{1}$')"

which gives the list of children of 00, namely 0000 0001 给出了00的子级列表,即0000 0001

There are more than one question here: 这里有多个问题:
1. If I remove the $ sign, the performance improves dramatically (from 2 seconds to 0.2 seconds on local VM) but I also get additional results (000000 and 000100, results I don't need). 1.如果删除$符号,性能会显着提高(在本地VM上从2秒提高到0.2秒),但我还会得到其他结果(000000和000100,我不需要的结果)。 Is there a reason for this dramatic performance decrease ? 这种巨大的性能下降是否有原因? (since it should be an additional filter on a narrowed down list) (因为它应该是缩小列表中的附加过滤器)
2. Is there a way to filter by the length of the rowkey ? 2.是否可以按行键的长度进行过滤? (then I can ditch regex and use only startrow/endrow) - this has to be done in hbase shell. (然后我可以抛弃正则表达式,仅使用startrow / endrow)-这必须在hbase shell中完成。 For example FILTER=>"RowKeyLengthFilter(4)" 例如FILTER =>“ RowKeyLengthFilter(4)”
3. I cannot use word (\\w) or digit (\\d) in the regex string, is there a limitation of hbase shell ? 3.我不能在正则表达式字符串中使用单词(\\ w)或数字(\\ d),hbase shell是否有限制? Also tried with [[:alnum:]] and [[:digit:]] (thanks for Yunnosch for the suggestion) 还尝试了[[:alnum:]]和[[:digit:]](感谢Yunnosch的建议)

version = 1.1.0.1, r4de7d45cb593f98ae5d020080cbc7116d3e9d9a0, Sun May 17 12:52:10 PDT 2015 版本= 1.1.0.1,r4de7d45cb593f98ae5d020080cbc7116d3e9d9a0,PDT 2015年5月17日12:52:10

In General: 一般来说:

  • your regex string only matches for 3 characters -> 000 or 001 您的正则表达式字符串仅匹配3个字符-> 000或001
    -- eg 'regexstring:^00.{2}$' would match to 4 characters/digits -> 0000 -例如'regexstring:^ 00。{2} $'将匹配4个字符/数字-> 0000
  • is there a reason why you don't use brakets like 有没有理由不使用刹车

    scan 'tbl' , {ROWPREFIXFILTER => 'row2', FILTER => QualifierFilter (>=, 'binary:abc')) } 扫描'tbl',{ROWPREFIXFILTER =>'row2',FILTER => QualifierFilter(> =,'binary:abc'))}

  • why do you don't use RowPrefixFilter (instead of STARTROW and PrefixFilter)? 为什么不使用RowPrefixFilter(而不是STARTROW和PrefixFilter)?

regarding 3. : 关于3。

you have to mask the regex string (like you do eg in Java): 您必须屏蔽正则表达式字符串(就像您在Java中一样):

RowFilter(=,'regexstring:^\\d{4}$')

regarding 1. : 关于1。

I only would image that the query optimization without ending $ lets HBase return you an range (which could be fast to find via the hashing) but if you require the exact length HBase has to check again all entries in the relevant range (with all resources reserved and added to fulfil the task). 我只会想像查询优化没有结束$使HBase返回一个范围(可以通过哈希快速找到),但是如果您需要确切的长度,HBase必须再次检查相关范围内的所有条目(使用所有资源)保留并添加以完成任务)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM