简体繁体中英

Does it make sense to run an hbase mapreduce job with a million Scans?

原文 2014-02-11 21:56:29 6 1 hadoop/ mapreduce/ hbase

I have a dataset in hbase which is large enough that it takes a couple hours to run a mapreduce job on the entire dataset. I'd like to be able to break down the data using precomputed indexes: once a day map the entire data set and break it down into multiple indexes:

1% sample of all users
All users who are participating in a particular A/B experiment
All users on the nightly prerelease channel.
All users with a paticular addon (or whatever criterion we're interested in this week)

My thought was to just store a list of row IDs for the relevant records, and then later people can do little mapreduce jobs on just those rows. But a 1% sample is still 1M rows of data, and I'm not sure how to construct a mapreduce job on a list of a million rows.

Does it make any sense to create a table mapper job using initTableMapperJob(List scans) if there are going to be a million different Scan objects which make up the query? Are there other ways to do this so that I can still farm out the computation and I/O to the hbase cluster efficiently?

1 answers

Don't do a million scans. If you have a million non-contiguous ids, you could run a map/reduce job over the list of ids using a custom input format so that you divide the list up into a reasonable number of partitions (I would guess 4x the number of your m/r slots, but that number is not based on anything). That would give you a million get operations, which is probably better than a million scans.

If you are lucky enough to have a more reasonable number of contiguous ranges, then scans would be better than straight gets

HBase mapreduce job - Multiple scans - How to set the table of each Scan

HBase multiple table scans for the job

How to implement Hadoop mapreduce job as non map/reduce even if does not make any sense?

How does HBase mapreduce job communicate with server? (newbie question)

How may versions does HBase MapReduce reads when MapReduce is run on a table ?

Where to run MapReduce Job

Hbase mapreduce job: all column values are null

HBase bulk delete using MapReduce job

NullPoinerEcxeption while running HBase MapReduce Job

Hanging Mapreduce job while reading hbase tables

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question HBase mapreduce job - Multiple scans - How to set the table of each Scan HBase multiple table scans for the job How to implement Hadoop mapreduce job as non map/reduce even if does not make any sense? How does HBase mapreduce job communicate with server? (newbie question) How may versions does HBase MapReduce reads when MapReduce is run on a table ? Where to run MapReduce Job Hbase mapreduce job: all column values are null HBase bulk delete using MapReduce job NullPoinerEcxeption while running HBase MapReduce Job Hanging Mapreduce job while reading hbase tables

Related Tags

Does it make sense to run an hbase mapreduce job with a million Scans?

Question

1 answers

solution1 1 ACCPTED 2014-02-12 00:46:49

solution1
1 ACCPTED 2014-02-12 00:46:49