简体   繁体   中英

HBase : how to design this rowkey?

I have a large number of data to store into HBase. It's basically csv file containing product information:

date|product_id|client_id|client_name
2020-08-02|152341|1|Tom
2020-08-02|152341|2|Kate

The user should be able to retrieve a list of product info by (date, product_id) (which should be API parameters). (date, product_id) is not unique .

In this case, how do I design the rowkey in HBase?

As (date, product_id) is not unique, I must add a UUID to it when inserting data into HBase. So it will look like this: 2020-08-02_152341_[UUID] . It will work fine, but in this case there will be hotspot problem.

But if I add salt/hash like 01-2020-08-02_152341_[UUID] , how can I know what the UUID is? It's not part of user input. So I can neither use startKey/endKey (as there is salt) nor reconstruct the rowkey.

You need both. Here is how:

  • To avoid hotspots, prepend to row key a hash of date and product_id ( not UUID.). A simple hashing function such as murmur should do.
  • Since combination of date and product_id is not unique, you need to also append a value to your row key. This can be UUID. But, if possible , append it with a value of an existing attribute in your domain model if unique . (I see "1|Tom" as a record. Is that "1" unique?)

While accessing records from HBase, read rows by 'prefix filter'. In this case, your prefix would be:

hash(date + "_" + product_id) + "_"+ date + "_" + product_id + "_"

See setRowPrefixFilter for how to fetch by prefix. Alternatively, you may consider using a library such as hbase-orm to fetch records by prefix in an object oriented way ( Disclosure : I'm the author of the library).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM