简体   繁体   中英

HBase row key design for reads and updates

I'm try to understand the best way to design the key for my HBase Table.

My use case :

Structure right now

PersonID | BatchDate | PersonJSON

When some thing about the person is modified, a new PersonJSON and new a batchdate is inserted in to Hbase updating the old records. And every 4 hours a scan of all the people who are modified are then pushed to Hadoop for further processing.

If my key is just personID it great for updating the data. But my performance sucks because I have to add a filter on BatchData column to scan all the rows greater than a batch date.

If my key is a composite key like BatchDate|PersonID I could use startrow and endrow on the row key and get all the rows that have been modified. But then I would have lot of duplicated since the key is not unique and can no longer update a person.

Is bloom filter on row+col (personid+batchdate) an option ?

Any help is appreciated. Thanks, Abhishek

In addition to the table with PersonID as the rowkey, it sounds like you need a dual-write secondary index , with BatchDate as the rowkey.

Another option would be Apache Phoenix , which provides support for secondary indexes.

I usually do two steps: Create table one just have key is commbine of BatchDate+PersonId, value could be empty. Create table two just as normal you did. Key is PersonId Value is the whole data.

For date range query: query table one first to get the PersonIds, and then use Hbase batch get API to get the data by batch. it would be very fast.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM