简体   繁体   English

如何在MapReduce作业中使用HBase二级索引表作为输入?

[英]How to use a HBase secondary index table as and input in a MapReduce Job?

I new to HBase, I have a main table with rowkey =id-YYYYMMDD, and a secondary index table with rowkey =YYYYMMDD-id and a column with the rowkey in the main table. 我是HBase的新手,我有一个主表,其中包含rowkey = id-YYYYMMDD,以及一个带有rowkey = YYYYMMDD-id的二级索引表和一个带有主表中rowkey的列。 I will have about 1 million ids in the near future and I will need to create a MapReduce job to summarize the id in a given date (YYYYMMDD). 我将在不久的将来拥有大约100万个ID,我将需要创建一个MapReduce作业来总结给定日期的ID(YYYYMMDD)。

How do I pass the secondary index table to the mapreduce job so the corresponding "get(rowkey)" are run in the main table to get the columns and sumarize the data? 如何将二级索引表传递给mapreduce作业,以便在主表中运行相应的“get(rowkey)”以获取列并对数据进行sumarize?

You have 2 options: 你有2个选择:

  1. First you run a scan on the index table. 首先,在索引表上运行扫描。 Scan will have startRow and stopRow (eg '20190401' and '20190402'), so it will scan a continuous key space area and collect IDs from the main table. 扫描将有STARTROWstopRow(例如,“20190401”和“20190402”),所以它会扫描一个连续的密钥空间区域,并从主表中收集的ID。 Time complexity will be O(M), where M is a number of items in a given batch. 时间复杂度将为O(M),其中M是给定批次中的项目数。 Then you request data from main table by ids using Get. 然后使用Get通过ID请求主表中的数据。
  2. Since you have date as part of your main table key, you can just do a MapReduce scan with a Key filtering, which will run in O(N/P), where N is a total amount of rows in table and P is the parallelism of your cluster. 由于您将日期作为主表键的一部分,您可以使用键过滤进行MapReduce扫描,该过滤将在O(N / P)中运行,其中N是表中的总行数,P是并行度您的群集。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM