简体繁体 English

如何在Hbase Scan中跳过行？

[英]How to skip rows in Hbase Scan?

原文 2014-01-14 13:02:59 6 1 hadoop/ hbase

I am implementing a simple pagination, like go to page 1, page 2, page 3 and so on. 我正在实现一个简单的分页，例如转到第1页，第2页，第3页，依此类推。

In HBase Book I read that there is a PageFilter that has a constructor with one parameter that indicates the number of rows to return but the question is how to go, for example, to page 5 directly skipping pageSize*currentPageNumber rows? 在HBase Book中，我读到有一个具有构造函数的PageFilter，该构造函数带有一个参数，该参数指示要返回的行数，但问题是如何跳转到例如第5页，直接跳过pageSize * currentPageNumber行？ The example given in the HBase book seems like sequence pagination ie you can go to page 5 directly. HBase书中给出的示例似乎是序列分页，即您可以直接转到第5页。

Is there a way to skip rows in HBase? 有没有一种方法可以跳过HBase中的行？

Thanks in advance. 提前致谢。

1 个解决方案

The PageFilter doesn't provide any offset functionality, it works just like a limit clause, stopping the scan operation when you have enough data. PageFilter不提供任何偏移功能，它就像limit子句一样工作，当您有足够的数据时停止扫描操作。

It's important to say that HBase doesn't know how many rows a table has, you have to scan the whole table in order to get that count. 重要的是要说HBase不知道一个表有多少行，您必须扫描整个表才能获得该计数。 This alone, among other things, makes impossible to paginate the data (because you don't know the total page count or which is the offset of each row). 仅此一项，就无法对数据进行分页（因为您不知道总页数或每行的偏移量）。 Don't see it as a drawback, because this have a massive impact when you write tons of data. 不要认为它有缺点，因为这在您写入大量数据时会产生巨大的影响。

Having said that, pagination over millions (or billions) of rows doesn't make sense. 话虽这么说，但对数百万（或数十亿）行的分页没有意义。 You should design your tables in a way that you can always provide a starting point (rowkey), so you scan operation can start reading from there. 您应该以始终提供起点（行键）的方式设计表，以便扫描操作可以从此处开始读取。 You don't need to know the whole row key, both start & stop rows can be just a prefix (ie: If your data is naturally sorted by a 8 byte long timestamp, you can use it to fast-forward to previous hours, days, months...). 您无需知道整个行键，开始行和停止行都可以只是前缀（即：如果您的数据自然按照8字节长的时间戳进行排序，则可以使用它来快进前几个小时，天，月...）。

If you cannot provide any starting point (even partially), a very simple solution that could work for you would be to retrieve the records in batches (ie: batches of 1000 items which could be enough for 50 pages that can be easily handled client-side). 如果您无法提供任何起点（即使是部分起点），那么一个可能对您有用的非常简单的解决方案是分批检索记录（即：1000个项目的批处理足以容纳50页，可以轻松地处理客户-侧）。 Then, when you have reached the last page of the batch, just use the rowkey of the last item as the starting for the next scan operation which should retrieve another batch of 1000 rows, and so on.... The only drawback is that it would be costly to go straight to higher pages, because you need to load the previous batches first. 然后，当您到达该批次的最后一页时，只需将最后一项的行键用作下一个扫描操作的起点即可，该操作应检索另一批1000行，依此类推...。唯一的缺点是直接转到较高的页面会比较昂贵，因为您需要先加载以前的批次。