使用MapReduce作业删除HBase批量删除

Question

Using a mapreduce job I am trying to delete rows from a Hbase table. 使用mapreduce作业我试图从Hbase表中删除行。

I am getting the following error. 我收到以下错误。

java.lang.ClassCastException: org.apache.hadoop.hbase.client.Delete cannot be cast to org.apache.hadoop.hbase.KeyValue
        at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124)
        at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:551)
        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
        at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:99)
        at org.apache.hadoop.mapreduce.Reducer.reduce(Reducer.java:144)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.

It looks like this is caused by the output set to KeyValue by configureIncrementalLoad. 看起来这是由configureIncrementalLoad设置为KeyValue的输出引起的。 It only has PutSortReducer and KeyValueSortReducer but not a DeleteSortReducer. 它只有PutSortReducer和KeyValueSortReducer，但不是DeleteSortReducer。

My Code: 我的代码：

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Delete;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class DeleteRows extends Configured implements Tool {

    public static class Map extends
            Mapper<LongWritable, Text, ImmutableBytesWritable, Delete> {

        ImmutableBytesWritable hKey = new ImmutableBytesWritable();
        Delete delRow;

        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            hKey.set(value.getBytes());
            delRow = new Delete(hKey.get());
            context.write(hKey, delRow);
            // Update counters
            context.getCounter("RowsDeleted", "Success").increment(1);
        }
    }


    @SuppressWarnings("deprecation")
    public int run(String[] args) throws Exception {
        Configuration conf = new Configuration();
        args = new GenericOptionsParser(conf, args).getRemainingArgs();
        HBaseConfiguration.addHbaseResources(conf);

        Job job = new Job(conf, "Delete stuff!");
        job.setJarByClass(DeleteRows.class);

        job.setMapperClass(Map.class);
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(Delete.class);

        job.setInputFormatClass(TextInputFormat.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));

        HTable hTable = new HTable(args[2]);
        // Auto configure partitioner and reducer
        HFileOutputFormat.configureIncrementalLoad(job, hTable);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.waitForCompletion(true);
        return (0);
    }

    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new DeleteRows(), args);
        System.exit(exitCode);
    }
}

Is there a better / faster way to delete a large number of rows using their row keys? 是否有更好/更快的方法使用行键删除大量行？ Obviously deleting each row in a mapper is possible however I would imagine that is slower than bulk pushing deletes to the correct region server. 显然删除映射器中的每一行是可能的，但我认为这比批量推送删除到正确的区域服务器要慢。

Answer 1

Your goal is to generate HFile with Delete stream (actually deleting marks as KeyValue ) inside. 你的目标是在里面生成带有Delete流的HFile（实际上将标记删除为KeyValue ）。 And standard way to do so is to use HFileOutputFormat . 这样做的标准方法是使用HFileOutputFormat 。 Actually you can only place stream of KeyValue changes into this format and there is 2 standard reducers: PutSortReducer and KeyValueSortReducer . 实际上，您只能将KeyValue更改流放入此格式，并且有2个标准PutSortReducer器： PutSortReducer和KeyValueSortReducer 。 Setting number of reduce tasks to 0 you actually pass all Delete directly to output format which of course cannot work. 将reduce任务的数量设置为0你实际上将所有Delete直接Delete到输出格式当然不能正常工作。

Your most obvious options: 你最明显的选择：

Add your reducer DeleteSortReducer . 添加您的reducer DeleteSortReducer 。 Such reducers are pretty simple and you can just almost copy. 这种减速器非常简单，几乎可以复制。 You need only to extract individual KeyValue stream from Delete and sort them. 您只需从Delete中提取单个KeyValue流并对其进行排序。 PutSortReducer is good example for you. PutSortReducer就是很好的例子。 Put changes are not sorted so this is why such reducer is needed. Put更改没有排序，所以这就是需要这种reducer的原因。
Just construct not stream of Delete but stream of appropriate KeyValue containing delete marks. 只构造不是Delete流，而是构造包含删除标记的适当KeyValue流。 This is maybe best thing for speed. 这对速度来说可能是最好的。

Answer 2

Turns out by using TableMapReduceUtil.initTableReducerJob to setup the reducer instead of HFileOutputFormat.configureIncrementalLoad the code works fine. 通过使用TableMapReduceUtil.initTableReducerJob设置reducer而不是HFileOutputFormat.configureIncrementalLoad ，代码工作正常。

TableMapReduceUtil.initTableReducerJob(tableName, null, job);
job.setNumReduceTasks(0);

However, this still does not create deletes for the completebulkload utility. 但是，这仍然不会为completebulkload实用程序创建删除。 It simply executes the delete RPC. 它只是执行删除RPC。

使用MapReduce作业删除HBase批量删除

问题描述

2 个解决方案

解决方案1
2 2014-04-25 23:26:17

解决方案2
0 2014-04-25 05:32:19

使用MapReduce作业删除HBase批量删除

问题描述

2 个解决方案

解决方案1 2 2014-04-25 23:26:17

解决方案2 0 2014-04-25 05:32:19

解决方案1
2 2014-04-25 23:26:17

解决方案2
0 2014-04-25 05:32:19