简体   繁体   中英

Hbase sort on column qualifiers

I have an Hbase table with a couple of million records. Each record has a couple of properties describing the record, stored each in a column qualifier.(Mostly int or string values) I have aa requirement that I should be able to see the records paginated and sorted based on a column qualifier (or even more than one, in the future). What would be a best approach to do this? I have looked into secondary indexes using coprocessors (mostly hindex from huawei), but it doesn't seem to match my use case exactly. I've also thought about replicating all the data into multiple tables, one for each sort property, which would be included in the rowkey and then redirect queries to those tables. But this seems very tedious as I have a few so called properties already..

Thanks for any suggestions.

You need your NoSQL database to work just like a RDBMS, and given the size of your data your life would be a lot simpler if you stick to it, unless you expect exponential growth :) Also, you don't mention if your data gets updated, this is very important to make a good decision.

Having said that, you have a lot of options, here are some:

  • If you can wait for the results: Write a MapReduce task to do the scan, sort it and retrieve the top X rows, do you really need more than 1000 pages (20-50k rows) for each sort type?. Another option would be using something like Hive .

  • If you can aggregate the data and "reduce" the dataset: Write a MapReduce task to periodically export the newest aggregated data to a SQL table (which will handle the queries). I've done this a few times to and it works like a charm, but it depends on your requirements.

  • If you have plenty of storage: Write a MapReduce task to periodically regenerate (or append the data) a new table for each property (sorting by it in the row-key). You don't need multiple tables, just use a prefix in your rowkeys for each case, or, if you do not want tables and you won't have a lot queries, simply write the sorted data to csv files and store them in the HDFS, they could be easily read by your frontend app.

  • Manually maintain a secondary index: Which would not very tolerant to schema updates and new properties but would work great for near real-time results. To do it, you have to update your code to also to write to the secondary table with a good buffer to help with performance while avoiding hot regions. Think about this type of rowkeys: [4B SORT FIELD ID (4 chars)] [8B SORT FIELD VALUE] [8B timestamp] , with just one column storing the rowkey of the main table. To retrieve the data sorted by any of the fields just perform a SCAN using the SORT FIELD ID as start row + the starting sort field value as pivot for pagination (ignore it to get the first page, then set the last one retrieved), that way you'll have the rowkeys of the main table, and you can just perform a multiget to it to retrieve the full data. Keep in mind that you'll need a small script to scan the main table and write the data to the index table for the existing rows.

  • Rely on any of the automatic secondary indexing through coprocessors like you mentioned, although I do not like this option at all.

You have mostly enumerated the options. HBase natively does not support secondary indexes as you are aware. In addition to hindex you may consider phoenix

https://github.com/forcedotcom/phoenix

( from SalesForce) which in addition to secondary indexes has jdbc driver and sql support.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM