简体   繁体   中英

HBase chain MapReduce job with broadcasting smaller tables to all Mappers

I am trying to write a chained MapReduce job on data present in HBase tables and need some help with the concept. I am not expecting people to provide code by pseudo code for this based on HBase's Java API would be nice.

In a nutshell, what I am trying to do is,

MapReduce Job 1: Read data from two tables with no common row keys and create a summary out of them in the reducer. The output of the reducer is a Java Object containing the summary which has been serialized to byte code. I store this object in a temporary table in HBase.

MapReduce Job 2: This is where I am having problems. I now need to read this summary object such that it is available in each mapper so that when I read data from a third (different) table, I can use this summary object to perform more calculations on the data I am reading from the third table.

I read about distributed cache and tried to implement it, but that doesn't seem to work out. I can provide more details in the form of edits if the need arises because I don't want to spam this question, right now, with details which might be irrelevant.

Well, this might sound stupid, but if we have a really small table which we query, we can probably get away with reading the values using the HBase Java API (even in a MapReduce job) and then storing them in static variables. That way, we have to read those values only once per Mapper and it won't be much of an overhead.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM