My hbase table look like this:
key---------value
id1/bla value1
id1/blabla value2
id2/bla value3
id2/blabla value4
....
There are million of keys that start with id1 and millions of key that start with id2.
I want to read the data from hbase with mapReduce because there are a lot of keys that starts with the same Id and 1 map per id isn't good enough. I prefer 100 mappers per Id
I want that more than 1 mapper will run on the same scannerResult that has been filtered by id. I read about TableMapReduceUtil and tried the following:
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleSummary");
job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
TableMapReduceUtil.initTableMapperJob(
sourceTable, // input table
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper class
Text.class, // mapper output key
IntWritable.class, // mapper output value
job);
With map function that will look like this(it should iterate scanner result):
public static class MyMapper extends TableMapper<Text, IntWritable> {
private final IntWritable ONE = new IntWritable(1);
private Text text = new Text();
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
text.set("123"); // we can only emit Writables...
context.write(text, ONE);
}
}
<br>
My questions are:
I'll start with #4 in your list:
The default behavior is to create one mapper per region. Therefore, instead of trying to hack the TableInputFormat
into creating custom input splits based on your specifications, you should first consider splitting your data into 100 regions (and then you'll have 100 mappers pretty well balanced).
This approach improves both your read and write performance, as you'll be less vulnerable to hotspotting (assuming that you have more than one or two region servers in your cluster).
The preferred way to go about this is to pre-split your table (ie define the splits on table creation).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.