简体   繁体   English

一个 mapreduce 程序的输出作为另一个 mapreduce 程序的输入

[英]output of one mapreduce program as input to another mapreduce program

I am trying a simple example, in which the output of one MapReduce job should be the input of another MapReduce job.我正在尝试一个简单的例子,其中一个 MapReduce 作业的输出应该是另一个 MapReduce 作业的输入。

The flow should be like this: Mapper1 --> Reducer1 --> Mapper2 --> Reducer2 (The output of Mapper1 must be the input of Reducer1. The output of Reducer1 must be the input of Mapper2. The output of Mapper2 must be the input of Reducer2. The output of Reducer2 must be stored in output file).流程应该是这样的: Mapper1 --> Reducer1 --> Mapper2 --> Reducer2Mapper1 --> Reducer1 --> Mapper2 --> Reducer2的输出必须是Reducer1的输入。Reducer1的输出必须是Mapper2的输入。Mapper2的输出必须是Reducer2 的输入。Reducer2 的输出必须存储在输出文件中)。

How can I add multiple Mappers and Reducers to my program such that the flow is maintained like above?如何将多个 Mappers 和 Reducers 添加到我的程序中,以便像上面一样维护流程?

Do I need to use Chain Mappers or Chain Reducers?我需要使用 Chain Mappers 或 Chain Reducers 吗? If so how can I use them?如果是这样,我该如何使用它们?

You need to implement two separate MapReduce jobs for that.您需要为此实现两个单独的 MapReduce 作业。 The result of the first job needs to be written to some persistent storage (like HDFS) and will be read by the second job.第一个作业的结果需要写入一些持久存储(如 HDFS),并由第二个作业读取。 The SequenceOutputFormat/InputFormat is often used for that. SequenceOutputFormat/InputFormat 通常用于此目的。 Both MapReduce jobs can be executed from the same driver program.两个 MapReduce 作业都可以从同一个驱动程序执行。

I guess what you are looking for is ControlledJob and JobControl.我猜你要找的是 ControlledJob 和 JobControl。 It aptly fits your purpose.它恰如其分地符合您的目的。 In a single Driver class you can build multiple jobs which have dependencies on each other.在单个 Driver 类中,您可以构建多个相互依赖的作业。 Following code might help you understand.以下代码可能会帮助您理解。

    Job jobOne = Job(jobOneConf, "Job-1");
    FileInputFormat.addInputPath(jobOne, jobOneInput);
    FileOutputFormat.setOutputPath(jobOne, jobOneOutput);
    ControlledJob jobOneControl = new ControlledJob(jobOneConf);
    jobOneControl.setJob(jobOne);

    Job jobTwo = Job(jobTwoConf, "Job-2");
    FileInputFormat.addInputPath(jobTwo, jobOneOutput); // here we set the job-1's output as job-2's input
    FileOutputFormat.setOutputPath(jobTwo, jobTwoOutput); // final output
    ControlledJob jobTwoControl = new ControlledJob(jobTwoConf);
    jobTwoControl.setJob(jobTwo);

    JobControl jobControl = new JobControl("job-control");
    jobControl.add(jobOneControl);
    jobControl.add(jobTwoControl);
    jobTwoControl.addDependingJob(jobOneControl); // this condition makes the job-2 wait until job-1 is done

    Thread jobControlThread = new Thread(jobControl);
    jobControlThread.start();
    jobControlThread.join(); 

    /* The jobControl.allFinished() can also be used to wait until all jobs are done */

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM