Using a mapreduce job I am trying to delete rows from a Hbase table.
I am getting the following error.
java.lang.ClassCastException: org.apache.hadoop.hbase.client.Delete cannot be cast to org.apache.hadoop.hbase.KeyValue
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:551)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:99)
at org.apache.hadoop.mapreduce.Reducer.reduce(Reducer.java:144)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.
It looks like this is caused by the output set to KeyValue by configureIncrementalLoad. It only has PutSortReducer and KeyValueSortReducer but not a DeleteSortReducer.
My Code:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Delete;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class DeleteRows extends Configured implements Tool {
public static class Map extends
Mapper<LongWritable, Text, ImmutableBytesWritable, Delete> {
ImmutableBytesWritable hKey = new ImmutableBytesWritable();
Delete delRow;
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
hKey.set(value.getBytes());
delRow = new Delete(hKey.get());
context.write(hKey, delRow);
// Update counters
context.getCounter("RowsDeleted", "Success").increment(1);
}
}
@SuppressWarnings("deprecation")
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
args = new GenericOptionsParser(conf, args).getRemainingArgs();
HBaseConfiguration.addHbaseResources(conf);
Job job = new Job(conf, "Delete stuff!");
job.setJarByClass(DeleteRows.class);
job.setMapperClass(Map.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Delete.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
HTable hTable = new HTable(args[2]);
// Auto configure partitioner and reducer
HFileOutputFormat.configureIncrementalLoad(job, hTable);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return (0);
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new DeleteRows(), args);
System.exit(exitCode);
}
}
Is there a better / faster way to delete a large number of rows using their row keys? Obviously deleting each row in a mapper is possible however I would imagine that is slower than bulk pushing deletes to the correct region server.
Your goal is to generate HFile with Delete
stream (actually deleting marks as KeyValue
) inside. And standard way to do so is to use HFileOutputFormat
. Actually you can only place stream of KeyValue
changes into this format and there is 2 standard reducers: PutSortReducer
and KeyValueSortReducer
. Setting number of reduce tasks to 0 you actually pass all Delete
directly to output format which of course cannot work.
Your most obvious options:
DeleteSortReducer
. Such reducers are pretty simple and you can just almost copy. You need only to extract individual KeyValue stream from Delete and sort them. PutSortReducer
is good example for you. Put
changes are not sorted so this is why such reducer is needed. Delete
but stream of appropriate KeyValue
containing delete marks. This is maybe best thing for speed. Turns out by using TableMapReduceUtil.initTableReducerJob
to setup the reducer instead of HFileOutputFormat.configureIncrementalLoad
the code works fine.
TableMapReduceUtil.initTableReducerJob(tableName, null, job);
job.setNumReduceTasks(0);
However, this still does not create deletes for the completebulkload utility. It simply executes the delete RPC.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.