简体繁体 English

Hbase对列限定符进行排序

[英]Hbase sort on column qualifiers

原文 2014-01-07 15:24:21 9 2 java/ hadoop/ hbase

I have an Hbase table with a couple of million records. 我有一个Hbase表，有几百万条记录。 Each record has a couple of properties describing the record, stored each in a column qualifier.(Mostly int or string values) I have aa requirement that I should be able to see the records paginated and sorted based on a column qualifier (or even more than one, in the future). 每条记录都有几个描述记录的属性，每个记录都存储在一个列限定符中。（主要是int或字符串值）我要求我能够看到基于列限定符分页和排序的记录（甚至更多）不止一个，将来）。 What would be a best approach to do this? 这样做的最佳方法是什么？ I have looked into secondary indexes using coprocessors (mostly hindex from huawei), but it doesn't seem to match my use case exactly. 我使用协处理器（主要是huawei的hindex）查看了二级索引，但它似乎与我的用例完全不符。 I've also thought about replicating all the data into multiple tables, one for each sort property, which would be included in the rowkey and then redirect queries to those tables. 我还考虑过将所有数据复制到多个表中，每个表对应一个排序属性，这些表将包含在rowkey中，然后将查询重定向到这些表。 But this seems very tedious as I have a few so called properties already.. 但这似乎非常繁琐，因为我已经有一些所谓的属性..

Thanks for any suggestions. 谢谢你的任何建议。

2 个解决方案

You need your NoSQL database to work just like a RDBMS, and given the size of your data your life would be a lot simpler if you stick to it, unless you expect exponential growth :) Also, you don't mention if your data gets updated, this is very important to make a good decision. 你需要你的NoSQL数据库才能像RDBMS一样工作，并且考虑到数据的大小，如果你坚持下去，你的生活会更简单，除非你期望指数增长:)另外，你没有提到你的数据是否得到了更新，这对做出正确的决定非常重要。

Having said that, you have a lot of options, here are some: 话虽如此，你有很多选择，这里有一些：

If you can wait for the results: Write a MapReduce task to do the scan, sort it and retrieve the top X rows, do you really need more than 1000 pages (20-50k rows) for each sort type?. 如果你可以等待结果：写一个MapReduce任务来进行扫描，对它进行排序并检索前X行，你真的需要每个排序类型超过1000页（20-50k行）吗？ Another option would be using something like Hive . 另一种选择是使用像Hive这样的东西。
If you can aggregate the data and "reduce" the dataset: Write a MapReduce task to periodically export the newest aggregated data to a SQL table (which will handle the queries). 如果您可以聚合数据并“减少”数据集：编写MapReduce任务以定期将最新的聚合数据导出到SQL表（将处理查询）。 I've done this a few times to and it works like a charm, but it depends on your requirements. 我已经做过几次了，它就像一个魅力，但它取决于你的要求。
If you have plenty of storage: Write a MapReduce task to periodically regenerate (or append the data) a new table for each property (sorting by it in the row-key). 如果您有足够的存储空间：编写MapReduce任务以定期为每个属性重新生成（或附加数据）一个新表（在行键中按它排序）。 You don't need multiple tables, just use a prefix in your rowkeys for each case, or, if you do not want tables and you won't have a lot queries, simply write the sorted data to csv files and store them in the HDFS, they could be easily read by your frontend app. 您不需要多个表，只需在每个案例的rowkeys中使用前缀，或者，如果您不想要表，并且您不会有很多查询，只需将已排序的数据写入csv文件并将其存储在HDFS，您的前端应用程序可以轻松读取它们。
Manually maintain a secondary index: Which would not very tolerant to schema updates and new properties but would work great for near real-time results. 手动维护一个二级索引：哪个不能容忍架构更新和新属性，但对于近实时结果非常有用。 To do it, you have to update your code to also to write to the secondary table with a good buffer to help with performance while avoiding hot regions. 要做到这一点，您必须更新代码，以便使用良好的缓冲区写入辅助表，以帮助提高性能，同时避免热区。 Think about this type of rowkeys: [4B SORT FIELD ID (4 chars)] [8B SORT FIELD VALUE] [8B timestamp] , with just one column storing the rowkey of the main table. 想想这种类型的rowkeys： [4B SORT FIELD ID (4 chars)] [8B SORT FIELD VALUE] [8B timestamp]字符串[4B SORT FIELD ID (4 chars)] [8B SORT FIELD VALUE] [8B timestamp] ，只有一列存储主表的rowkey。 To retrieve the data sorted by any of the fields just perform a SCAN using the SORT FIELD ID as start row + the starting sort field value as pivot for pagination (ignore it to get the first page, then set the last one retrieved), that way you'll have the rowkeys of the main table, and you can just perform a multiget to it to retrieve the full data. 要检索按任何字段排序的数据，只需使用SORT FIELD ID作为起始行执行SCAN，将起始排序字段值作为分页的枢轴（忽略它以获取第一页，然后设置最后一个检索到的），即你将拥有主表的rowkeys的方式，你可以只执行一个multiget来检索完整的数据。 Keep in mind that you'll need a small script to scan the main table and write the data to the index table for the existing rows. 请记住，您需要一个小脚本来扫描主表，并将数据写入现有行的索引表。
Rely on any of the automatic secondary indexing through coprocessors like you mentioned, although I do not like this option at all. 依靠你提到的协处理器的任何自动二级索引，尽管我根本不喜欢这个选项。

You have mostly enumerated the options. 您主要列举了选项。 HBase natively does not support secondary indexes as you are aware. 您知道HBase本身不支持二级索引。 In addition to hindex you may consider phoenix 除了hindex，你可以考虑凤凰

https://github.com/forcedotcom/phoenix https://github.com/forcedotcom/phoenix

( from SalesForce) which in addition to secondary indexes has jdbc driver and sql support. （来自SalesForce）除了二级索引之外还有jdbc驱动程序和sql支持。