简体   繁体   中英

HBase row key and Range Scan

I have a row key pattern like 20110103--- so that the row keys will be like, 20110103-1-23-333.

When I do a range query using scan for example, startRow -> 20110103-1-23- endRow -> 20110105-1-23-

I am getting rows that are not within the above range as well. For example i get the row 20110105-1-15-6666 as well. I am getting the rows related to store 15 as well.

How can I fix this ?? Will RegularExpressionFilter resolve this....

Please advice on this issue ....

Of the three row keys you listed:

20110103-1-23-
20110105-1-15-666
20110105-1-23-

That looks like the natural sort order to me; the one ending in "666" does indeed come after the one starting with "20110103".

(One point of confusion may be that to HBase, these are all just bytes, and the lexicographical sort is done one byte at a time; so, "aaa" will sort after "aa" but before "ab".)

你可以打开hbase shell发出以下命令

scan 'YourHbaseTableName',{FILTER=>"(RowFilter(=,'regexstring:20110103'))"}

Row 20110105-1-15-6666 is correctly in the range [20110103-1-23-, 20110105-1-23-) because 15 is smaller than 23 and the sorting of row is lexicographic.

You mentioned "I am getting rows related to store 15 as well", which makes me imagine that the third number in the row key ( ________-_-23- )is some kind of attribute of the row.

I suggest changing the schema of this table to make this "store number" a column, so that your keys can look like 20110103-1 and in the column "store" you have those numbers 15 or 23 or whatever.

This way, in a Scan, you can filter away the rows that have column store=15.

If you are using the Java API, this will look something like:

SingleColumnValueFilter filter = new SingleColumnValueFilter(
   Bytes.toBytes("columnfamily"),
   Bytes.toBytes("storenumber"),
   CompareFilter.CompareOp.NOT_EQUAL,
   Bytes.toBytes(15)
);
filter.setFilterIfMissing(true);
Scan scan = new Scan(
   Bytes.toBytes("20110103-1"),
   Bytes.toBytes("20110105-1")
);
scan.setFilter(filter);

You might be storing too much data in the row key, try to take some of those attributes in the row key and make them a column. Also keep in mind that you can also use dates (I suppose 20110105 is a date) as timestamps (of the table's cells) instead of row keys. It depends on your application.

Assume HBase as a multiple nested ordered map of bytes. Therefore you need to save your timestamps in a binary presentation to get the right order in each query.

I think you save your rowkey-values in the string data types, using for instance the java method:

yourDateString.getBytes(encoding) 

or

Bytes.toBytes(yourDateString)

added by the HBase API.

My advice is to save time values as a timestamp (long). This long should be serialized to bytes and afterwards saved in the rowkey. Note that saving timestamps in the rowkey is a bit problematic due to the constantly rising nature. The timestamp will get bigger with every millisecond so every new value will be saved to the HBase region which manages this region. So easyly speaking, you just write to one of your cluster machines and this is not the goal using an HBase cluster. For clusters with size to 100 machnies you can use salting (put a random number in front of the rowkey to distribute all values all over your cluster). Check out the phoenix project. It does the serializing, salting, etc. all transperently for you, providing simple SQL like statements.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM