简体   繁体   中英

How to create a chain of Hadoop job without using OOzie

I want to create a chain of three Hadoop jobs, where the output of one job is fed as the input to the second job and so on. I would like to do this without using Oozie.

I have written the following code to acheive it :-

public class TfIdf {
    public static void main(String args[]) throws IOException, InterruptedException, ClassNotFoundException
    {
        TfIdf tfIdf = new TfIdf();
        tfIdf.runWordCount();
        tfIdf.runDocWordCount();
        tfIdf.TFIDFComputation();
    }

    public void runWordCount() throws IOException, InterruptedException, ClassNotFoundException
    {
        Job job = new Job();


        job.setJarByClass(TfIdf.class);
        job.setJobName("Word Count calculation");

        job.setMapperClass(WordFrequencyMapper.class);
        job.setReducerClass(WordFrequencyReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.setInputPaths(job, new Path("input"));
        FileOutputFormat.setOutputPath(job, new Path("ouput"));

        job.waitForCompletion(true);
    }

    public void runDocWordCount() throws IOException, InterruptedException, ClassNotFoundException
    {
        Job job = new Job();

        job.setJarByClass(TfIdf.class);
        job.setJobName("Word Doc count calculation");

        job.setMapperClass(WordCountDocMapper.class);
        job.setReducerClass(WordCountDocReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.setInputPaths(job, new Path("output"));
        FileOutputFormat.setOutputPath(job, new Path("ouput_job2"));

        job.waitForCompletion(true);
    }

    public void TFIDFComputation() throws IOException, InterruptedException, ClassNotFoundException
    {
        Job job = new Job();

        job.setJarByClass(TfIdf.class);
        job.setJobName("TFIDF calculation");

        job.setMapperClass(TFIDFMapper.class);
        job.setReducerClass(TFIDFReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.setInputPaths(job, new Path("output_job2"));
        FileOutputFormat.setOutputPath(job, new Path("ouput_job3"));

        job.waitForCompletion(true);
    }
}

However I get the error:

Input path does not exist: hdfs://localhost.localdomain:8020/user/cloudera/output

Could anyone help me out with this?

This answer is coming a little late, but... It's just a simple typo in your dir names. You've written your 1st job's output to dir "ouput", and your 2nd job is looking for it in "output".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM