简体   繁体   中英

How to use a HBase secondary index table as and input in a MapReduce Job?

I new to HBase, I have a main table with rowkey =id-YYYYMMDD, and a secondary index table with rowkey =YYYYMMDD-id and a column with the rowkey in the main table. I will have about 1 million ids in the near future and I will need to create a MapReduce job to summarize the id in a given date (YYYYMMDD).

How do I pass the secondary index table to the mapreduce job so the corresponding "get(rowkey)" are run in the main table to get the columns and sumarize the data?

You have 2 options:

  1. First you run a scan on the index table. Scan will have startRow and stopRow (eg '20190401' and '20190402'), so it will scan a continuous key space area and collect IDs from the main table. Time complexity will be O(M), where M is a number of items in a given batch. Then you request data from main table by ids using Get.
  2. Since you have date as part of your main table key, you can just do a MapReduce scan with a Key filtering, which will run in O(N/P), where N is a total amount of rows in table and P is the parallelism of your cluster.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM