简体   繁体   English

如何使用MapReduce将CSV导入HBASE表

[英]How to import a CSV into HBASE table using MapReduce

Hi I am quite new to hadoop and I am trying to import a csv table to Hbase using MapReduce. 嗨,我是hadoop的新手,我正在尝试使用MapReduce将csv表导入到Hbase。

I am using hadoop 1.2.1 and hbase 1.1.1 我正在使用hadoop 1.2.1和hbase 1.1.1

i have data in following format: 我有以下格式的数据:

Wban Number, YearMonthDay, Time, Hourly Precip

03011,20060301,0050,0

03011,20060301,0150,0

I have written the following code for bulk load 我为批量加载编写了以下代码

public class BulkLoadDriver extends Configured implements Tool{

public static void main(String [] args) throws Exception{


    int result= ToolRunner.run(HBaseConfiguration.create(), new BulkLoadDriver(), args);
}

public static enum COUNTER_TEST{FILE_FOUND, FILE_NOT_FOUND};
public String tableName="hpd_table";// name of the table to be inserted in hbase

@Override
public int run(String[] args) throws Exception {

    //Configuration conf= this.getConf();

    Configuration conf = HBaseConfiguration.create();
    Job job= new Job(conf,"BulkLoad"); 
    job.setJarByClass(getClass());

    job.setMapperClass(bulkMapper.class);

    FileInputFormat.setInputPaths(job, new Path(args[0]));
    job.setInputFormatClass(TextInputFormat.class);


    TableMapReduceUtil.initTableReducerJob(tableName, null, job);   //for HBase table
    job.setNumReduceTasks(0);
    return (job.waitForCompletion(true)?0:1);


}
private static class bulkMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put>{
    //static class bulkMapper extends TableMapper<ImmutableBytesWritable, Put> {

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
    {
        String [] val= value.toString().split(",");


        // store the split values in the bytes format so that they can be added to the PUT object
        byte[] wban=Bytes.toBytes(val[0]);
        byte[] ymd= Bytes.toBytes(val[1]);
        byte[] tym=Bytes.toBytes(val[2]);
        byte[] hPrec=Bytes.toBytes(val[3]);

        Put put=new Put(wban);
        put.add(ymd, tym, hPrec);

        System.out.println(wban);
        context.write(new ImmutableBytesWritable(wban), put);

        context.getCounter(COUNTER_TEST.FILE_FOUND).increment(1);

    }

}

} }

I have created a jar for this and ran following in the terminal: 我为此创建了一个jar并在终端中运行以下命令:

hadoop jar ~/hadoop-1.2.1/MRData/bulkLoad.jar bulkLoad.BulkLoadDriver /MR/input/200603hpd.txt hpd_table hadoop jar〜/ hadoop-1.2.1 / MRData / bulkLoad.jar bulkLoad.BulkLoadDriver /MR/input/200603hpd.txt hpd_table

But the output that I get is hundreds of following type of lines: attempt_201509012322_0001_m_000000_0: [B@2d22bfc8 attempt_201509012322_0001_m_000000_0: [B@445cfa9e 但是我得到的输出是数百行以下类型的行:try_201509012322_0001_m_000000_0:[B @ 2d22bfc8 try_201509012322_0001_m_000000_0:[B @ 445cfa9e

I am not sure what do they mean and how to perform this bulk upload. 我不确定它们是什么意思,以及如何执行此批量上传。 please help. 请帮忙。

Thanks in advance. 提前致谢。

There are several ways to import data into HBase. 有几种方法可以将数据导入HBase。 Please have a look at this following link: 请查看以下链接:

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/admin_hbase_import.html http://www.cloudera.com/content/cloudera/zh-CN/documentation/core/latest/topics/admin_hbase_import.html

HBase BulkLoad: HBase批量加载:

  1. Data file in CSV format CSV格式的数据文件

  2. Process your data into HFile format . 将数据处理为HFile格式 See http://hbase.apache.org/book/hfile_format.html for details about HFile format. 有关HFile格式的详细信息,请参见http://hbase.apache.org/book/hfile_format.html Usually you use a MapReduce job for the conversion, and you often need to write the Mapper yourself because your data is unique. 通常,您使用MapReduce作业进行转换,并且由于数据是唯一的,因此经常需要自己编写Mapper。 The job must to emit the row key as the Key, and either a KeyValue, a Put, or a Delete as the Value. 作业必须发出行键作为Key,并发出KeyValue,Put或Delete作为值。 The Reducer is handled by HBase; Reducer由HBase处理; configure it using HFileOutputFormat.configureIncrementalLoad() and it does the following: 使用HFileOutputFormat.configureIncrementalLoad()对其进行配置,并执行以下操作:

    • Inspects the table to configure a total order partitioner 检查表以配置总订单分区程序
    • Uploads thepartitions file to the cluster and adds it to the DistributedCache 将分区文件上载到集群并将其添加到DistributedCache
    • Sets the number of reduce tasks to match the current number of regions 设置化简任务数以匹配当前区域数
    • Sets the output key/value class to match HFileOutputFormat requirements 设置输出键/值类以匹配HFileOutputFormat要求
    • Sets the Reducer to perform the appropriate sorting (either KeyValueSortReducer or PutSortReducer) 将Reducer设置为执行适当的排序(KeyValueSortReducer或PutSortReducer)
  3. One HFile is created per region in the output folder . 在输出文件夹中的每个区域创建一个HFile Input data is almost completely re-written, so you need available disk space at least twice the size of the original data set. 输入数据几乎完全被重写,因此您需要的可用磁盘空间至少是原始数据集大小的两倍。 For example, for a 100 GB output from mysqldump, you should have at least 200 GB of available disk space in HDFS. 例如,对于mysqldump的100 GB输出,HDFS中至少应有200 GB可用磁盘空间。 You can delete the original input file at the end of the process. 您可以在过程结束时删除原始输入文件。

  4. Load the files into HBase. 将文件加载到HBase中。 Use the LoadIncrementalHFiles command (more commonly known as the completebulkload tool), passing it a URL that locates the files in HDFS. 使用LoadIncrementalHFiles命令(通常称为completebulkload工具),向其传递一个URL,该URL在HDFS中定位文件。 Each file is loaded into the relevant region on the RegionServer for the region. 每个文件都被加载到RegionServer上该区域的相关区域中。 You can limit the number of versions that are loaded by passing the --versions= N option, where N is the maximum number of versions to include, from newest to oldest (largest timestamp to smallest timestamp). 您可以通过传递--versions = N选项来限制加载的版本数,其中N是要包含的最大版本数,从最新到最旧(最大时间戳到最小时间戳)。 If a region was split after the files were created, the tool automatically splits the HFile according to the new boundaries. 如果在创建文件后拆分了一个区域,则该工具会根据新边界自动拆分HFile。 This process is inefficient, so if your table is being written to by other processes, you should load as soon as the transform step is done. 该过程效率低下,因此,如果您的表正在被其他过程写入,则应在完成转换步骤后立即加载。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM