Cron job use for running hadoop program in linux

Question

我是两个新的Linux用户，在我的项目中我们使用的是hadoop。现在我们已经编写了3个mapreduce程序，将第一个程序的输出输入到第二个程序中，第二个程序的输出输入到第三个程序中。运行这3个不同的conf意味着首先我们要运行第一个程序的配置，然后运行第二个程序，然后运行第三个程序。现在，我们希望两个运行一个完整的3个程序，另一个可以在Linux中使用cron job，如果是，请提及步骤。我们需要两个使用cron的作业，因为我们需要重复运行两个三个程序三个小时

Answer 1

1. Create a shell script by using && to execute your hadoop programs sequentially. Execute your first command and then use && then your second command and so on.

Ex: first command && second command && third command

2. Type this in terminal:

crontab -e

This will open cronjob editor in terminal.

Add this line to run your shell script every 15 mins,

*/15 * * * * /path/to/your/shell/script

For more help about crontab, see https://help.ubuntu.com/community/CronHowto

DELETE/COPY OUTPUT DIRECTORY:

If you want to avoid directory already exists error, delete or copy the output directory before executing your hadoop sequential jobs. Add this in your shell script before hadoop job commands:

# Delete the output directory in HDFS
hadoop fs -rmr /your/hdfs/output/directory/to/be/deleted
# Copy the output directory from HDFS to HDFS
hadoop fs -mkdir /new/hdfs/location
hadoop fs -cp /your/hdfs/output/directory/to/be/copied/*.* /new/hdfs/location
# Copy from HDFS to local filesystem
sudo mkdir /path/to/local/filesystem
hadoop fs -copyToLocal /your/hdfs/output/directory/to/be/copied/*.* /path/to/local/filesystem

NOTE: If you are using latest hadoop version, replace hadoop fs with hdfs dfs and -rmr with -rm -r . Dont forget to add "*.*" when copying a directory since this will copy all contents of that directory. Change the HDFS file paths as per your configuration.

Answer 2

the best way for handling this case is with chain mapreduce approach.

http://gandhigeet.blogspot.in/2012/12/as-discussed-in-previous-post-hadoop.html

i am posting driver code for calling three mapreduce jobs..

 public class ExerciseDriver {


static Configuration conf;

public static void main(String[] args) throws Exception {

    ExerciseDriver ED = new ExerciseDriver();
    conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);

    if(args.length < 2) {
        System.out.println("Too few arguments. Arguments should be:  <hdfs input folder> <hdfs output folder> ");
        System.exit(0);
    }

    String pathin1 = args[0];
    String pathout1 = args[1];


     //Run first Map reduce job
    fs.delete(new Path(pathout1+"_1"), true);

    ED.runFirstJob(pathin1, pathout1+"_1");

    ED.runSecondJob(pathout1+"_1", pathout1+"_2");

    ED.runThirdJob(pathout1+"_2", pathout1+"3");


}

  public int runFirstJob(String pathin, String pathout) throws Exception {

    Job job = new Job(conf);
    job.setJarByClass(ExerciseDriver.class);
    job.setMapperClass(ExerciseMapper1.class);
    job.setCombinerClass(ExerciseCombiner.class);
    job.setReducerClass(ExerciseReducer1.class);
    job.setInputFormatClass(ParagrapghInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class); 
    FileInputFormat.addInputPath(job, new Path(pathin));
    FileOutputFormat.setOutputPath(job, new Path(pathout));

   job.submit();  

   job.getMaxMapAttempts();


    boolean success = job.waitForCompletion(true);
    return success ? 0 : -1;

}

  public int runSecondJob(String pathin, String pathout) throws Exception { 
    Job job = new Job(conf);
    job.setJarByClass(ExerciseDriver.class);
    job.setMapperClass(ExerciseMapper2.class);
    job.setReducerClass(ExerciseReducer2.class);
    job.setInputFormatClass(KeyValueTextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);    
    FileInputFormat.addInputPath(job,new Path(pathin));
    FileOutputFormat.setOutputPath(job, new Path(pathout));
    boolean success = job.waitForCompletion(true);
    return success ? 0 : -1;
}

   public int runThirdJob(String pathin, String pathout) throws Exception { 
    Job job = new Job(conf);
    job.setJarByClass(ExerciseDriver.class);
    job.setMapperClass(ExerciseMapper3.class);
    job.setReducerClass(ExerciseReducer3.class);
    job.setInputFormatClass(KeyValueTextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);    
    FileInputFormat.addInputPath(job,new Path(pathin));
    FileOutputFormat.setOutputPath(job, new Path(pathout));
    boolean success = job.waitForCompletion(true);
    return success ? 0 : -1;
}

  }

after just schedule the jar file in crontab. or else you can also go with oozie.as we mentioned in driver class the 3 mapreduce are executed one after other.first output is for second one input

Cron job use for running hadoop program in linux

Question

2 answers

solution1
2 2015-04-14 17:18:02

solution2
0 2015-04-14 06:08:04

Cron job use for running hadoop program in linux

Question

2 answers

solution1 2 2015-04-14 17:18:02

solution2 0 2015-04-14 06:08:04

solution1
2 2015-04-14 17:18:02

solution2
0 2015-04-14 06:08:04