如何有效地扫描HBase行

Question

I need to write a MapReduce Job that Gets all rows in a given Date Range(say last one month). 我需要编写一个MapReduce Job来获取给定Date Range中的所有行（比如说最后一个月）。 It would have been a cakewalk had My Row Key started with Date. 如果My Row Key以Date开头，那将是一场轻松的尝试。 But My frequent Hbase queries are on starting values of key. 但我频繁的Hbase查询是关键的起始值。

My Row key is exactly A|B|C|20120121|D . 我的行键恰好是A | B | C | 20120121 | D. Where combination of A/B/C along with date (in YearMonthDay format) makes a unique row ID. A / B / C与日期（以YearMonthDay格式）的组合形成唯一的行ID。

My Hbase tables could have upto a few million rows. 我的Hbase表可能有几百万行。 Should my Mapper read all the table and filter each row if it falls in given date range or Scan / Filter can help handling this situation? 我的Mapper是否应该读取所有表格并过滤掉每一行，如果它落在给定的日期范围内，或者扫描/过滤器可以帮助处理这种情况？

Could someone suggest (or a snippet of code) a way to handle this situation in an effective manner? 有人可以建议（或代码片段）以有效的方式处理这种情况吗？

Thanks -Panks 谢谢-Panks

Answer 1

A RowFilter with a RegEx Filter would work, but would not be the most optimal solution. 带有RegEx过滤器的RowFilter可以工作，但不是最佳解决方案。 Alternatively you can try to use secondary indexes. 或者，您可以尝试使用二级索引。

One more solution is to try the FuzzyRowFIlter . 另一个解决方案是尝试FuzzyRowFIlter 。 A FuzzyRowFilter uses a kind of fast-forwarding, hence skipping many rows in the overall scan process and will thus be faster than a RowFilter Scan. FuzzyRowFilter使用一种快进，因此在整个扫描过程中跳过许多行，因此比RowFilter Scan更快。 You can read more about it here . 你可以在这里阅读更多相关信息。

Alternatively BloomFilters might also help depending on your schema. 或者，BloomFilters也可能会有所帮助，具体取决于您的架构。 If your data is huge you should do a comparative analysis on secondary index and Bloom Filters. 如果您的数据很大，您应该对二级索引和布隆过滤器进行比较分析。

Answer 2

You can use a RowFilter with a RegexStringComparator . 您可以将RowFilter与RegexStringComparator一起使用。 You'd need to come up with a RegEx that filters your dates appropriately. 您需要提供适当过滤日期的RegEx。 This page has an example that includes setting a Filter for a MapReduce scanner. 此页面包含一个示例，其中包括为MapReduce扫描程序设置过滤器。

Answer 3

我刚刚开始使用HBase， bloom过滤器可能有所帮助。

Answer 4

You can modify the Scan that you send into the Mapper to include a filter. 您可以修改发送到Mapper的扫描以包含过滤器。 If your date is also the record timestamp, it's easy: 如果您的日期也是记录时间戳，则很容易：

Scan scan = new Scan();
scan.setTimeRange(minTime, maxTime);
TableMapReduceUtil.initTableMapperJob("mytable", scan, MyTableMapper.class, 
     OutputKey.class, OutputValue.class, job);

If the date in your row key is different, you'll have to add a filter to your scan. 如果行键中的日期不同，则必须在扫描中添加过滤器。 This filter can operate on a column or a row key. 此过滤器可以对列或行键进行操作。 I think it's going to be messy with just the row key. 我认为只有行键才会变得混乱。 If you put the date in a column, you can make a FilterList where all conditions must be true and use a CompareOp.GREATER and a CompareOp.LESS . 如果将日期放在列中，则可以创建一个FilterList ，其中所有条件都必须为true，并使用CompareOp.GREATER和CompareOp.LESS 。 Then use scan.setFilter(filterList) to add your filters to the scan. 然后使用scan.setFilter(filterList)将过滤器添加到扫描中。

如何有效地扫描HBase行

问题描述

4 个解决方案

解决方案1
10 2012-12-26 09:53:27

解决方案2
5 已采纳 2012-01-23 04:57:58

解决方案3
0 2012-01-22 15:22:34

解决方案4
0 2012-01-23 04:50:57

如何有效地扫描HBase行

问题描述

4 个解决方案

解决方案1 10 2012-12-26 09:53:27

解决方案2 5 已采纳 2012-01-23 04:57:58

解决方案3 0 2012-01-22 15:22:34

解决方案4 0 2012-01-23 04:50:57

解决方案1
10 2012-12-26 09:53:27

解决方案2
5 已采纳 2012-01-23 04:57:58

解决方案3
0 2012-01-22 15:22:34

解决方案4
0 2012-01-23 04:50:57