简体   繁体   English

Hadoop HDFS MapReduce输出到MongoDb

[英]Hadoop HDFS MapReduce output into MongoDb

I want to write Java program which reads input from HDFS, processes it using MapReduce and writes the output into a MongoDb. 我想编写Java程序,它从HDFS读取输入,使用MapReduce处理它并将输出写入MongoDb。

Here is the scenario: 这是场景:

  1. I have a Hadoop Cluster which has 3 datanodes. 我有一个Hadoop集群,它有3个数据节点。
  2. A java program reads the input from the HDFS, processes it using MapReduce. java程序从HDFS读取输入,使用MapReduce处理它。
  3. Finally, write the result into a MongoDb. 最后,将结果写入MongoDb。

Actually, reading from HDFS and processing it with MapReduce are simple. 实际上,从HDFS读取并使用MapReduce处理它很简单。 But I gets stuck about writing the result into a MongoDb. 但我对将结果写入MongoDb感到困惑。 Is there any Java API supported to write the result into MongoDB? 是否支持将Java API写入MongoDB? Another question is that since it is a Hadoop Cluster, so we don't know which datanode will run the Reducer task and generate the result, is it possible to write the result into a MongoDb which is installed on a specific server? 另一个问题是,由于它是一个Hadoop集群,所以我们不知道哪个datanode将运行Reducer任务并生成结果,是否可以将结果写入安装在特定服务器上的MongoDb?

If I want to write the result into HDFS, the code will be like this: 如果我想将结果写入HDFS,代码将如下所示:

@Override
public void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException 
{
    long sum = 0;
    for (LongWritable value : values) 
    {
        sum += value.get();
    }

    context.write(new Text(key), new LongWritable(sum));
}

Now I want to write the result into a MongoDb instead of HDFS, how can I do that? 现在我想将结果写入MongoDb而不是HDFS,我该怎么做?

You want «MongoDB Connector for Hadoop» . 你想要«MongoDB Connector for Hadoop» The examples . 例子

It's tempting to just add code in your Reducer that, as a side effect, inserts data into your database. 在Reducer中添加代码很容易,因为副作用是将数据插入到数据库中。 Avoid this temptation. 避免这种诱惑。 One reason to use a connector as opposed to just inserting data as a side effect of your reducer class is speculative execution: Hadoop can sometimes run two of the exact same reduce tasks in parallel, which can lead to extraneous inserts and duplicate data. 使用连接器而不是仅仅插入数据作为reducer类的副作用的一个原因是推测性执行:Hadoop有时可以并行运行两个完全相同的reduce任务,这可能导致无关的插入和重复数据。

Yes. 是。 You write to mongo as usual. 你像往常一样写信给mongo。 The fact that your mongo db is set to run on shards is a detail that is hidden from you. 您的mongo db设置为在分片上运行的事实是一个隐藏的细节。

I spent my morning to implement the same scenario. 我花了我的早晨实现相同的方案。 Here my solution: 我的解决方案:

Create three classes: 创建三个类:

  • Experiment.java: for job configuration and submission Experiment.java:用于作业配置和提交
  • MyMap.java: mapper class MyMap.java:mapper
  • MyReduce.java: reducer class MyReduce.java:reducer

     import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import com.mongodb.hadoop.io.BSONWritable; import com.mongodb.hadoop.mapred.MongoOutputFormat; public class Experiment extends Configured implements Tool{ public int run(final String[] args) throws Exception { final Configuration conf = getConf(); conf.set("mongo.output.uri", args[1]); final JobConf job = new JobConf(conf); FileInputFormat.setInputPaths(job, new Path(args[0])); job.setJarByClass(Experiment.class); job.setInputFormat(org.apache.hadoop.mapred.TextInputFormat.class); job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class); job.setOutputFormat(MongoOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(BSONWritable.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); JobClient.runJob(job); return 0; } public static void main(final String[] args) throws Exception{ int res = ToolRunner.run(new TweetPerUserToMongo(), args); System.exit(res); } } 

When you run Experiment class from your cluster, you will enter two parameters. 从群集中运行Experiment类时,将输入两个参数。 First parameter is your input source from HDFS location, second parameter refers to mongodb URI that is going keep your results. 第一个参数是来自HDFS位置的输入源,第二个参数是指将保持结果的mongodb URI。 Here is an example call. 这是一个示例调用。 Assuming that your Experiment.java is under the package name org.example. 假设您的Experiment.java位于包名org.example下。

sudo -u hdfs hadoop jar ~/jar/myexample.jar org.example.Experiment myfilesinhdfs/* mongodb://192.168.0.1:27017/mydbName.myCollectionName

This might not be the best way but it does the job for me. 这可能不是最好的方式,但它可以帮助我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM