使用MapReduce作業刪除HBase批量刪除

Question

使用mapreduce作業我試圖從Hbase表中刪除行。

我收到以下錯誤。

java.lang.ClassCastException: org.apache.hadoop.hbase.client.Delete cannot be cast to org.apache.hadoop.hbase.KeyValue
        at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124)
        at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:551)
        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
        at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:99)
        at org.apache.hadoop.mapreduce.Reducer.reduce(Reducer.java:144)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.

看起來這是由configureIncrementalLoad設置為KeyValue的輸出引起的。 它只有PutSortReducer和KeyValueSortReducer，但不是DeleteSortReducer。

我的代碼：

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Delete;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class DeleteRows extends Configured implements Tool {

    public static class Map extends
            Mapper<LongWritable, Text, ImmutableBytesWritable, Delete> {

        ImmutableBytesWritable hKey = new ImmutableBytesWritable();
        Delete delRow;

        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            hKey.set(value.getBytes());
            delRow = new Delete(hKey.get());
            context.write(hKey, delRow);
            // Update counters
            context.getCounter("RowsDeleted", "Success").increment(1);
        }
    }


    @SuppressWarnings("deprecation")
    public int run(String[] args) throws Exception {
        Configuration conf = new Configuration();
        args = new GenericOptionsParser(conf, args).getRemainingArgs();
        HBaseConfiguration.addHbaseResources(conf);

        Job job = new Job(conf, "Delete stuff!");
        job.setJarByClass(DeleteRows.class);

        job.setMapperClass(Map.class);
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(Delete.class);

        job.setInputFormatClass(TextInputFormat.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));

        HTable hTable = new HTable(args[2]);
        // Auto configure partitioner and reducer
        HFileOutputFormat.configureIncrementalLoad(job, hTable);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.waitForCompletion(true);
        return (0);
    }

    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new DeleteRows(), args);
        System.exit(exitCode);
    }
}

是否有更好/更快的方法使用行鍵刪除大量行？ 顯然刪除映射器中的每一行是可能的，但我認為這比批量推送刪除到正確的區域服務器要慢。

Answer 1

你的目標是在里面生成帶有Delete流的HFile（實際上將標記刪除為KeyValue ）。 這樣做的標准方法是使用HFileOutputFormat 。 實際上，您只能將KeyValue更改流放入此格式，並且有2個標准PutSortReducer器： PutSortReducer和KeyValueSortReducer 。 將reduce任務的數量設置為0你實際上將所有Delete直接Delete到輸出格式當然不能正常工作。

你最明顯的選擇：

添加您的reducer DeleteSortReducer 。 這種減速器非常簡單，幾乎可以復制。 您只需從Delete中提取單個KeyValue流並對其進行排序。 PutSortReducer就是很好的例子。 Put更改沒有排序，所以這就是需要這種reducer的原因。
只構造不是Delete流，而是構造包含刪除標記的適當KeyValue流。 這對速度來說可能是最好的。

Answer 2

通過使用TableMapReduceUtil.initTableReducerJob設置reducer而不是HFileOutputFormat.configureIncrementalLoad ，代碼工作正常。

TableMapReduceUtil.initTableReducerJob(tableName, null, job);
job.setNumReduceTasks(0);

但是，這仍然不會為completebulkload實用程序創建刪除。 它只是執行刪除RPC。

使用MapReduce作業刪除HBase批量刪除

問題描述

2 個解決方案

解決方案1
2 2014-04-25 23:26:17

解決方案2
0 2014-04-25 05:32:19

使用MapReduce作業刪除HBase批量刪除

問題描述

2 個解決方案

解決方案1 2 2014-04-25 23:26:17

解決方案2 0 2014-04-25 05:32:19

解決方案1
2 2014-04-25 23:26:17

解決方案2
0 2014-04-25 05:32:19