简体繁体中英

How can I limit the scan of HBase to only relevant (Unfiltered) regions for the MapReduce job

原文 2019-08-08 16:37:43 7 1 java/ mapreduce/ hbase

I am running a mapreduce job to export data from HBase to HDFS. There are multiple filters being applied to the scan.

It is not possible to limit the scan by the row key as it does not contain required information.

When it comes to running MR job, YARN creates a mapper for each region in HBase. Some of those regions contain only filtered data and hence mappers don't receive anything to read and get terminated after a period of time. The volume of data to be extracted is significantly less then the total amount of data, so the job eventually fails because of the large number of mappers being terminated.

The answer I am not looking for:

Implementing "manual" filtering within the mapper.
Increasing timeout interval.

What I am looking for is one of these:

A link to an article about how this problem is solved.
An efficient solution or an idea (Not necessarily with code) for this, which does not involve running a full HBase table through mappers. Or at least (Let's be real) reduces the compute load within the mappers.
A confirmation that there is no efficient way of doing this, as I've spent a fair amount of time looking for this.

I believe that the code sample is not necessary as the person who understands HBase will know what I am asking for.

Thanks in advice.

1 answers

In order to solve this problem I've created a MR job.

Mapper classified each row key in to one of the categories and picked the first and last element for each type (Because everything is sorted within a region). In order to pick the last element, I've been updating a single object and assigning each value, which was landing in to a mapper. Then I wrote both values in to context in the cleanup phase (classifier_name as a key and row_key as a value).

Mappers outputs were light (number of categories * 2), so I've set the number of reducers to 1, and wrote some basic logic to create object with low_row/high_row, which was updated on the flight and I did not have to sort anything at the end. So the final output was of the form:
classifier_name, start_rowKey, end_rowKey

I was then able to use these values to limit my scan.

Hope that will help someone :)

mapreduce, hbase and scan

How can I run a mapreduce job remotely

How does HBase mapreduce job communicate with server? (newbie question)

How can I limit Zebra TC520K touch computer scanner to only scan 9 digit barcodes?

Hbase scan vs Mapreduce for on the fly computation

How can I limit Spring component scan to only files in my war?

Hbase mapreduce job: all column values are null

HBase bulk delete using MapReduce job

How can I convert only the relevant parts of a Json string to a Set?

How to limit a Hadoop MapReduce job to a certain number of nodes?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question mapreduce, hbase and scan How can I run a mapreduce job remotely How does HBase mapreduce job communicate with server? (newbie question) How can I limit Zebra TC520K touch computer scanner to only scan 9 digit barcodes? Hbase scan vs Mapreduce for on the fly computation How can I limit Spring component scan to only files in my war? Hbase mapreduce job: all column values are null HBase bulk delete using MapReduce job How can I convert only the relevant parts of a Json string to a Set? How to limit a Hadoop MapReduce job to a certain number of nodes?

Related Tags

How can I limit the scan of HBase to only relevant (Unfiltered) regions for the MapReduce job

Question

1 answers

solution1 1 2019-08-14 16:29:15

solution1
1 2019-08-14 16:29:15