简体   繁体   中英

Cron job use for running hadoop program in linux

我是两个新的Linux用户,在我的项目中我们使用的是hadoop。现在我们已经编写了3个mapreduce程序,将第一个程序的输出输入到第二个程序中,第二个程序的输出输入到第三个程序中。运行这3个不同的conf意味着首先我们要运行第一个程序的配置,然后运行第二个程序,然后运行第三个程序。现在,我们希望两个运行一个完整的3个程序,另一个可以在Linux中使用cron job,如果是,请提及步骤。我们需要两个使用cron的作业,因为我们需要重复运行两个三个程序三个小时

1. Create a shell script by using && to execute your hadoop programs sequentially. Execute your first command and then use && then your second command and so on.

Ex: first command && second command && third command

2. Type this in terminal:

crontab -e

This will open cronjob editor in terminal.

Add this line to run your shell script every 15 mins,

*/15 * * * * /path/to/your/shell/script

For more help about crontab, see https://help.ubuntu.com/community/CronHowto

DELETE/COPY OUTPUT DIRECTORY:

If you want to avoid directory already exists error, delete or copy the output directory before executing your hadoop sequential jobs. Add this in your shell script before hadoop job commands:

# Delete the output directory in HDFS
hadoop fs -rmr /your/hdfs/output/directory/to/be/deleted
# Copy the output directory from HDFS to HDFS
hadoop fs -mkdir /new/hdfs/location
hadoop fs -cp /your/hdfs/output/directory/to/be/copied/*.* /new/hdfs/location
# Copy from HDFS to local filesystem
sudo mkdir /path/to/local/filesystem
hadoop fs -copyToLocal /your/hdfs/output/directory/to/be/copied/*.* /path/to/local/filesystem

NOTE: If you are using latest hadoop version, replace hadoop fs with hdfs dfs and -rmr with -rm -r . Dont forget to add "*.*" when copying a directory since this will copy all contents of that directory. Change the HDFS file paths as per your configuration.

the best way for handling this case is with chain mapreduce approach.

http://gandhigeet.blogspot.in/2012/12/as-discussed-in-previous-post-hadoop.html

i am posting driver code for calling three mapreduce jobs..

 public class ExerciseDriver {


static Configuration conf;

public static void main(String[] args) throws Exception {

    ExerciseDriver ED = new ExerciseDriver();
    conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);

    if(args.length < 2) {
        System.out.println("Too few arguments. Arguments should be:  <hdfs input folder> <hdfs output folder> ");
        System.exit(0);
    }

    String pathin1 = args[0];
    String pathout1 = args[1];


     //Run first Map reduce job
    fs.delete(new Path(pathout1+"_1"), true);

    ED.runFirstJob(pathin1, pathout1+"_1");

    ED.runSecondJob(pathout1+"_1", pathout1+"_2");

    ED.runThirdJob(pathout1+"_2", pathout1+"3");


}

  public int runFirstJob(String pathin, String pathout) throws Exception {

    Job job = new Job(conf);
    job.setJarByClass(ExerciseDriver.class);
    job.setMapperClass(ExerciseMapper1.class);
    job.setCombinerClass(ExerciseCombiner.class);
    job.setReducerClass(ExerciseReducer1.class);
    job.setInputFormatClass(ParagrapghInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class); 
    FileInputFormat.addInputPath(job, new Path(pathin));
    FileOutputFormat.setOutputPath(job, new Path(pathout));

   job.submit();  

   job.getMaxMapAttempts();


    boolean success = job.waitForCompletion(true);
    return success ? 0 : -1;

}

  public int runSecondJob(String pathin, String pathout) throws Exception { 
    Job job = new Job(conf);
    job.setJarByClass(ExerciseDriver.class);
    job.setMapperClass(ExerciseMapper2.class);
    job.setReducerClass(ExerciseReducer2.class);
    job.setInputFormatClass(KeyValueTextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);    
    FileInputFormat.addInputPath(job,new Path(pathin));
    FileOutputFormat.setOutputPath(job, new Path(pathout));
    boolean success = job.waitForCompletion(true);
    return success ? 0 : -1;
}

   public int runThirdJob(String pathin, String pathout) throws Exception { 
    Job job = new Job(conf);
    job.setJarByClass(ExerciseDriver.class);
    job.setMapperClass(ExerciseMapper3.class);
    job.setReducerClass(ExerciseReducer3.class);
    job.setInputFormatClass(KeyValueTextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);    
    FileInputFormat.addInputPath(job,new Path(pathin));
    FileOutputFormat.setOutputPath(job, new Path(pathout));
    boolean success = job.waitForCompletion(true);
    return success ? 0 : -1;
}

  }

after just schedule the jar file in crontab. or else you can also go with oozie.as we mentioned in driver class the 3 mapreduce are executed one after other.first output is for second one input

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM