简体   繁体   中英

HBase Scan - RowKey Filters

So, let's see if I can explain briefly my issue.

Imagine we got an HBase table that has the information of every visit to a disco: Every disco registers its name , the name of the visitor , and the day he visited it. ( yes it's a dumb example, I know .. ).

So, for example, these would be some values of the table:

..
ministryOfSoundJamesOliver01022017
ministryOfSoundJamesOliver02022017
ministryOfSoundJamesOliver03022017
ministryOfSoundOliviaNewton04042017
ministryOfSoundOliviaNewton06042017
...
pachaibizaJohnMcKiness06042017
pachaibizaJohnMcKiness04042017
pachaibizaWilliamForrester04042017
..

The RowKey has the following structure:

discoName

personName

dayOfTheYear

(the table has some other columns/qualifiers, but I don't mind about them for this issue).


The issue is: imagine a boy that simply loves going to Ministry Of Sound. He just loves it, he spends all his money in disco and drugs ( but that's not the point here ).

My goal is to output every person who attended Ministry Of Sound . In my scan, this dude keeps appearing in the results, so I must discard a lot of entries in search of the next visitor. FE:

..
ministryOfSoundJohnnyYonkie01022017
ministryOfSoundJohnnyYonkie02022017
ministryOfSoundJohnnyYonkie03022017
ministryOfSoundJohnnyYonkie04022017
ministryOfSoundJohnnyYonkie05022017
ministryOfSoundAnotherDude02022017
...

In order to register AnotherDude , I must discard 4 entries from Johnny .

Finally, the question is:


Is there any way to tell HBase that the repetitive entries from byte(x) to byte(x+y) [ x being the number of bytes from discoName and y number of bytes from personName ] must be automatically discarded ?


Thanks a lot in advance!!

First things first: If you only have client access, I can't help you :(

If you have additional access, then you could look at the following propositions, but the default reply would be: If this is your access pattern, optimize your schema for it.

If you need to access data in a certain way, make sure you write it in that way, in the first place. Use the map-reduce API if you have to perform migrations.

I would probably simply add a table which merely writes a row ministryOfSound and a column per visitor. (In general, the schema you propose doesn't sound very well suited for HBase - since you have a bunch of writes with monotonically increasing rowkeys, if post-processing the duplicate results away is really a performance issue)

On the other hand, if this is an ad-hoc query, then you probably want to use the mapreduce-API straight away - maybe using the Apache Spark-interconnect and perform a "distinct" call on the data.

Using Scans for analytical queries isn't how I would do it.

If you had to do it using Scans, then I would recommend you implement a CoProcessor. These can augment Filter with state, and you can project the results of a PrefixFilter'd Scan on the Region Server side. If you're new to CoProcessors, here's an introduction: HBase: The Definitive Guide . This requires that you can deploy jars into the RegionServer classpath.

But again, if you blow up your client by doing a distinct filtering there, you're probably also blowing up your regions due to hotspots on the inserts.

As a final alternative: You might want to look at Apache Phoenix, and see if you can coerce your rowkey into a schema, from which you can do a select distinct on the first two parts of the rowkey. This would obviously require that you have delimiter in your rowkey, or at least a fixed length.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM