简体   繁体   中英

Using an HBase table as MapReduce source

As far as I understood when using an hbase table as the source to a mapreduce job, we have define the value for the scan. LEt's say we set it to 500, does this mean that each mapper is only given 500 rows from the hbase table? Is there any problem if we set it to a very high value ?

If the scan size is small, don't we have the same problem as having small files in mapreduce?

Here's the sample code from the HBase Book on how to run a MapReduce job reading from an HBase table.

Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class);     // class that contains mapper

Scan scan = new Scan();
scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
// set other scan attrs
...

TableMapReduceUtil.initTableMapperJob(
   tableName,        // input HBase table name
   scan,             // Scan instance to control CF and attribute selection
   MyMapper.class,   // mapper
   null,             // mapper output key
   null,             // mapper output value
   job);
job.setOutputFormatClass(NullOutputFormat.class);   // because we aren't emitting anything from mapper

boolean b = job.waitForCompletion(true);
if (!b) {
    throw new IOException("error with job!");
}

When you say "value for the scan", that's not a real thing. You either mean scan.setCaching() or scan.setBatch() or scan.setMaxResultSize() .

  1. setCaching is used to tell the server how many rows to load before returning the result to the client
  2. setBatch is used to limit the number of columns returned in each call if you have a very wide table
  3. setMaxResultSize is used to limit the number of results returned to the client

Typically with you don't set the MaxResultSize in a MapReduce job. So you will see all of the data.

Reference for the above information is here .

The mapper code that you write is given the data row by row. The mapper run-time however would read the records by the caching side (ie 500 rows at a time in your case).

if the scan size is too small the execution becomes very inefficient (lots of io calls)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM