[英]HBase bulk delete using MapReduce job
Using a mapreduce job I am trying to delete rows from a Hbase table. 使用mapreduce作业我试图从Hbase表中删除行。
I am getting the following error. 我收到以下错误。
java.lang.ClassCastException: org.apache.hadoop.hbase.client.Delete cannot be cast to org.apache.hadoop.hbase.KeyValue
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:551)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:99)
at org.apache.hadoop.mapreduce.Reducer.reduce(Reducer.java:144)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.
It looks like this is caused by the output set to KeyValue by configureIncrementalLoad. 看起来这是由configureIncrementalLoad设置为KeyValue的输出引起的。 It only has PutSortReducer and KeyValueSortReducer but not a DeleteSortReducer.
它只有PutSortReducer和KeyValueSortReducer,但不是DeleteSortReducer。
My Code: 我的代码:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Delete;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class DeleteRows extends Configured implements Tool {
public static class Map extends
Mapper<LongWritable, Text, ImmutableBytesWritable, Delete> {
ImmutableBytesWritable hKey = new ImmutableBytesWritable();
Delete delRow;
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
hKey.set(value.getBytes());
delRow = new Delete(hKey.get());
context.write(hKey, delRow);
// Update counters
context.getCounter("RowsDeleted", "Success").increment(1);
}
}
@SuppressWarnings("deprecation")
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
args = new GenericOptionsParser(conf, args).getRemainingArgs();
HBaseConfiguration.addHbaseResources(conf);
Job job = new Job(conf, "Delete stuff!");
job.setJarByClass(DeleteRows.class);
job.setMapperClass(Map.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Delete.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
HTable hTable = new HTable(args[2]);
// Auto configure partitioner and reducer
HFileOutputFormat.configureIncrementalLoad(job, hTable);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return (0);
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new DeleteRows(), args);
System.exit(exitCode);
}
}
Is there a better / faster way to delete a large number of rows using their row keys? 是否有更好/更快的方法使用行键删除大量行? Obviously deleting each row in a mapper is possible however I would imagine that is slower than bulk pushing deletes to the correct region server.
显然删除映射器中的每一行是可能的,但我认为这比批量推送删除到正确的区域服务器要慢。
Your goal is to generate HFile with Delete
stream (actually deleting marks as KeyValue
) inside. 你的目标是在里面生成带有
Delete
流的HFile(实际上将标记删除为KeyValue
)。 And standard way to do so is to use HFileOutputFormat
. 这样做的标准方法是使用
HFileOutputFormat
。 Actually you can only place stream of KeyValue
changes into this format and there is 2 standard reducers: PutSortReducer
and KeyValueSortReducer
. 实际上,您只能将
KeyValue
更改流放入此格式,并且有2个标准PutSortReducer
器: PutSortReducer
和KeyValueSortReducer
。 Setting number of reduce tasks to 0 you actually pass all Delete
directly to output format which of course cannot work. 将reduce任务的数量设置为0你实际上将所有
Delete
直接Delete
到输出格式当然不能正常工作。
Your most obvious options: 你最明显的选择:
DeleteSortReducer
. DeleteSortReducer
。 Such reducers are pretty simple and you can just almost copy. PutSortReducer
is good example for you. PutSortReducer
就是很好的例子。 Put
changes are not sorted so this is why such reducer is needed. Put
更改没有排序,所以这就是需要这种reducer的原因。 Delete
but stream of appropriate KeyValue
containing delete marks. Delete
流,而是构造包含删除标记的适当KeyValue
流。 This is maybe best thing for speed. Turns out by using TableMapReduceUtil.initTableReducerJob
to setup the reducer instead of HFileOutputFormat.configureIncrementalLoad
the code works fine. 通过使用
TableMapReduceUtil.initTableReducerJob
设置reducer而不是HFileOutputFormat.configureIncrementalLoad
,代码工作正常。
TableMapReduceUtil.initTableReducerJob(tableName, null, job);
job.setNumReduceTasks(0);
However, this still does not create deletes for the completebulkload utility. 但是,这仍然不会为completebulkload实用程序创建删除。 It simply executes the delete RPC.
它只是执行删除RPC。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.