简体繁体 English

如何在MapReduce作业中使用HBase二级索引表作为输入？

[英]How to use a HBase secondary index table as and input in a MapReduce Job?

原文 2019-04-23 13:18:49 0 1 hadoop/ mapreduce/ hbase

I new to HBase, I have a main table with rowkey =id-YYYYMMDD, and a secondary index table with rowkey =YYYYMMDD-id and a column with the rowkey in the main table. 我是HBase的新手，我有一个主表，其中包含rowkey = id-YYYYMMDD，以及一个带有rowkey = YYYYMMDD-id的二级索引表和一个带有主表中rowkey的列。 I will have about 1 million ids in the near future and I will need to create a MapReduce job to summarize the id in a given date (YYYYMMDD). 我将在不久的将来拥有大约100万个ID，我将需要创建一个MapReduce作业来总结给定日期的ID（YYYYMMDD）。

How do I pass the secondary index table to the mapreduce job so the corresponding "get(rowkey)" are run in the main table to get the columns and sumarize the data? 如何将二级索引表传递给mapreduce作业，以便在主表中运行相应的“get（rowkey）”以获取列并对数据进行sumarize？

1 个解决方案

You have 2 options: 你有2个选择：

First you run a scan on the index table. 首先，在索引表上运行扫描。 Scan will have startRow and stopRow (eg '20190401' and '20190402'), so it will scan a continuous key space area and collect IDs from the main table. 扫描将有STARTROW和stopRow（例如，“20190401”和“20190402”），所以它会扫描一个连续的密钥空间区域，并从主表中收集的ID。 Time complexity will be O(M), where M is a number of items in a given batch. 时间复杂度将为O（M），其中M是给定批次中的项目数。 Then you request data from main table by ids using Get. 然后使用Get通过ID请求主表中的数据。
Since you have date as part of your main table key, you can just do a MapReduce scan with a Key filtering, which will run in O(N/P), where N is a total amount of rows in table and P is the parallelism of your cluster. 由于您将日期作为主表键的一部分，您可以使用键过滤进行MapReduce扫描，该过滤将在O（N / P）中运行，其中N是表中的总行数，P是并行度您的群集。