As far as I understood when using an hbase table as the source to a mapreduce job, we have define the value for the scan. LEt's say we set it to 500, does this mean that each mapper is only given 500 rows from the hbase table? Is there any problem if we set it to a very high value ?
If the scan size is small, don't we have the same problem as having small files in mapreduce?
Here's the sample code from the HBase Book on how to run a MapReduce job reading from an HBase table.
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
...
TableMapReduceUtil.initTableMapperJob(
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper
null, // mapper output key
null, // mapper output value
job);
job.setOutputFormatClass(NullOutputFormat.class); // because we aren't emitting anything from mapper
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
}
When you say "value for the scan", that's not a real thing. You either mean scan.setCaching()
or scan.setBatch()
or scan.setMaxResultSize()
.
setCaching
is used to tell the server how many rows to load before returning the result to the client setBatch
is used to limit the number of columns returned in each call if you have a very wide table setMaxResultSize
is used to limit the number of results returned to the client Typically with you don't set the MaxResultSize
in a MapReduce job. So you will see all of the data.
Reference for the above information is here .
The mapper code that you write is given the data row by row. The mapper run-time however would read the records by the caching side (ie 500 rows at a time in your case).
if the scan size is too small the execution becomes very inefficient (lots of io calls)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.