简体   繁体   中英

How can I limit the scan of HBase to only relevant (Unfiltered) regions for the MapReduce job

I am running a mapreduce job to export data from HBase to HDFS. There are multiple filters being applied to the scan.

It is not possible to limit the scan by the row key as it does not contain required information.

When it comes to running MR job, YARN creates a mapper for each region in HBase. Some of those regions contain only filtered data and hence mappers don't receive anything to read and get terminated after a period of time. The volume of data to be extracted is significantly less then the total amount of data, so the job eventually fails because of the large number of mappers being terminated.

The answer I am not looking for:

  • Implementing "manual" filtering within the mapper.
  • Increasing timeout interval.

What I am looking for is one of these:

  • A link to an article about how this problem is solved.

  • An efficient solution or an idea (Not necessarily with code) for this, which does not involve running a full HBase table through mappers. Or at least (Let's be real) reduces the compute load within the mappers.

  • A confirmation that there is no efficient way of doing this, as I've spent a fair amount of time looking for this.

I believe that the code sample is not necessary as the person who understands HBase will know what I am asking for.

Thanks in advice.

In order to solve this problem I've created a MR job.

Mapper classified each row key in to one of the categories and picked the first and last element for each type (Because everything is sorted within a region). In order to pick the last element, I've been updating a single object and assigning each value, which was landing in to a mapper. Then I wrote both values in to context in the cleanup phase (classifier_name as a key and row_key as a value).

Mappers outputs were light (number of categories * 2), so I've set the number of reducers to 1, and wrote some basic logic to create object with low_row/high_row, which was updated on the flight and I did not have to sort anything at the end. So the final output was of the form:
classifier_name, start_rowKey, end_rowKey

I was then able to use these values to limit my scan.

Hope that will help someone :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM