简体   繁体   English

将HDFS数据移至MongoDB

[英]Moving HDFS data into MongoDB

I am trying to move HDFS data into MongoDB. 我正在尝试将HDFS数据移动到MongoDB中。 I know how to export data into mysql by using sqoop. 我知道如何使用sqoop将数据导出到mysql中。 I dont think I can use sqoop for MongoDb. 我认为我不能为MongoDb使用sqoop。 I need help understanding how to do that. 我需要帮助来了解如何执行此操作。

This recipe will use the MongoOutputFormat class to load data from an HDFS instance into a MongoDB collection. 此配方将使用MongoOutputFormat类将数据从HDFS实例加载到MongoDB集合中。

Getting ready

The easiest way to get started with the Mongo Hadoop Adaptor is to clone the Mongo-Hadoop project from GitHub and build the project configured for a specific version of Hadoop. 开始使用Mongo Hadoop适配器的最简单方法是从GitHub克隆Mongo-Hadoop项目,并构建为特定Hadoop版本配置的项目。 A Git client must be installed to clone this project. 必须安装Git客户端才能克隆此项目。 This recipe assumes that you are using the CDH3 distribution of Hadoop. 本食谱假设您正在使用Hadoop的CDH3发行版。 The official Git Client can be found at http://git-scm.com/downloads . 可以在http://git-scm.com/downloads上找到Git官方客户端。

The Mongo Hadoop Adaptor can be found on GitHub at https://github.com/mongodb/ mongo-hadoop . 可以在GitHub上的https://github.com/mongodb/mongo-hadoop上找到Mongo Hadoop适配器。 This project needs to be built for a specific version of Hadoop. 需要为特定版本的Hadoop构建该项目。 The resulting JAR file must be installed on each node in the $HADOOP_HOME/lib folder. 必须将生成的JAR文件安装在$ HADOOP_HOME / lib文件夹中的每个节点上。 The Mongo Java Driver is required to be installed on each node in the $HADOOP_HOME/ lib folder. Mongo Java驱动程序需要安装在$ HADOOP_HOME / lib文件夹中的每个节点上。 It can be found at https://github.com/mongodb/mongo-java-driver/ downloads . 可以在https://github.com/mongodb/mongo-java-driver/downloads中找到。

How to do it... 怎么做...

 Complete the following steps to copy data form HDFS into MongoDB:
    1.   Clone the mongo-hadoop repository with the following command line:
    git clone https://github.com/mongodb/mongo-hadoop.git


    2.   Switch to the stable release 1.0 branch:
    git checkout release-1.0


    3.   Set the Hadoop version which mongo-hadoop should target. In the folder
    that mongo-hadoop was cloned to, open the build.sbt file with a text editor.
    Change the following line:
    hadoopRelease in ThisBuild := "default"
    to
    hadoopRelease in ThisBuild := "cdh3"


    4.   Build mongo-hadoop :
    ./sbt package
    This will create a file named mongo-hadoop-core_cdh3u3-1.0.0.jar in the
    core/target folder.


    5.   Download the MongoDB Java Driver Version 2.8.0 from https://github.com/
    mongodb/mongo-java-driver/downloads .


    6.   Copy mongo-hadoop and the MongoDB Java Driver to $HADOOP_HOME/lib on
    each node:


    cp mongo-hadoop-core_cdh3u3-1.0.0.jar mongo-2.8.0.jar $HADOOP_
    HOME/lib


    7.   Create a Java MapReduce program that will read the weblog_entries.txt file
    from HDFS and write them to MongoDB using the MongoOutputFormat class:


import java.io.*;
import org.apache.commons.logging.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.*;
import org.bson.*;
import org.bson.types.ObjectId;
import com.mongodb.hadoop.*;
import com.mongodb.hadoop.util.*;
public class ExportToMongoDBFromHDFS {
private static final Log log = LogFactory.getLog(ExportToMongoDBFromHDFS.class);
public static class ReadWeblogs extends Mapper<LongWritable, Text, ObjectId, BSONObject>{
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException{
           System.out.println("Key: " + key);
           System.out.println("Value: " + value);
           String[] fields = value.toString().split("\t");
           String md5 = fields[0];
           String url = fields[1];
           String date = fields[2];
           String time = fields[3];
           String ip = fields[4];
           BSONObject b = new BasicBSONObject();
            b.put("md5", md5);
            b.put("url", url);
            b.put("date", date);
            b.put("time", time);
              b.put("ip", ip);
          context.write( new ObjectId(), b);
       }
}
public static void main(String[] args) throws Exception{
       final Configuration conf = new Configuration();
          MongoConfigUtil.setOutputURI(conf,"mongodb://<HOST>:<PORT>/test.        weblogs");
System.out.println("Configuration: " + conf);
        final Job job = new Job(conf, "Export to Mongo");
         Path in = new Path("/data/weblogs/weblog_entries.txt");
            FileInputFormat.setInputPaths(job, in);
            job.setJarByClass(ExportToMongoDBFromHDFS.class);
            job.setMapperClass(ReadWeblogs.class);
            job.setOutputKeyClass(ObjectId.class);
            job.setOutputValueClass(BSONObject.class);
            job.setInputFormatClass(TextInputFormat.class);
            job.setOutputFormatClass(MongoOutputFormat.class);
            job.setNumReduceTasks(0);
           System.exit(job.waitForCompletion(true) ? 0 : 1 );
       }
}

8.   Export as a runnable JAR file and run the job:
hadoop jar ExportToMongoDBFromHDFS.jar
9.   Verify that the weblogs MongoDB collection was populated from the Mongo shell:
db.weblogs.find();

The basic problem is that mongo stores its data in BSON format (binary JSON), while you hdfs data may have different formats (txt, sequence, avro). 基本问题是mongo以BSON格式(二进制JSON)存储其数据,而hdfs数据可能具有不同的格式(txt,sequence和avro)。 The easiest thing to do would be to use pig to load your results using this driver: 最简单的方法是使用Pig使用此驱动程序加载结果:

https://github.com/mongodb/mongo-hadoop/tree/master/pig https://github.com/mongodb/mongo-hadoop/tree/master/pig

into mongo db. 进入mongo db。 You'll have to map your values to your collection - there's a good example on the git hub page. 您必须将值映射到您的集合-git枢纽页面上有一个很好的例子。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM