简体   繁体   中英

How filter Scan of HBase by part of row key?

I have HBase table with row keys, which consist of text ID and timestamp, like next:

...
string_id1.1470913344067
string_id1.1470913345067
string_id2.1470913344067
string_id2.1470913345067
...

How can I filter Scan of HBase (in Scala or Java) to get results with some string ID and timestamp more than some value?

Thanks

Fuzzy row approach is efficient for this kind of requirement and when data is is huge : As explained by this article FuzzyRowFilter takes as parameters row key and a mask info.

In example above, in case we want to find last logged in users and row key format is userId_actionId_timestamp (where userId has fixed length of say 4 chars), the fuzzy row key we are looking for is ????_login_ . This translates into the following params for FuzzyRowKey:

FuzzyRowFilter rowFilter = new FuzzyRowFilter(
 Arrays.asList(
  new Pair<byte[], byte[]>(
    Bytes.toBytesBinary("\x00\x00\x00\x00_login_"),
    new byte[] {1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0})));

Would suggest to go through hbase-the-definitive guide --> Client API: Advanced Features

Lets say you somehow ended up having your lines in a monadic traversable structure like List or RDD. Now, you want to have only the strings with id = "string_id2" and timestamp > 1470913345000 .

Now what is the problem here ? Just filter you traversable monadic structure on these two criteria.

val filtered = listOrRddOfLines
  .map(l => {
    val idStr :: timestampStr :: Nil = l.split('.').toList
    (idStr, timestampStr.toLong)
  })
  .filter({
    case (idStr, timestamp) => idStr.equals("string_id2") && (timestamp > "1470913345000".toLong)
  })

I resolve my problem by using to filters:
- PrefixFilter (I put to this filter first part of row key. In my case - string ID, for example "string_id1.")
- RowFilter (I put there two parametres: first - CompareOp.GREATER_OR_EQUAL , second - all my row key with necessary timestamp, for example "string_id1.1470913345000"

In result I get all cells with row key, which has necessary string_id if first part, and with timestamp more or equal than I put in filter in second part. It is exactly what I want.

Code snippet:

val s = new Scan()
s.addFamily(family.getBytes)
val filterList = new FilterList()
filterList.addFilter(new PrefixFilter(Bytes.toBytes(prefixOfRowKey)))
filterList.addFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new BinaryComparator(valueForBinaryFilter.getBytes())))
s.setFilter(filterList)
val scanner = table.getScanner(s)

Thanks to everyone who helped to find a solution.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM