简体   繁体   中英

How to run MapReduce Jobs sequentially one after another

I am running a single node cluster and handling timeseries data.I have a set of MapReduce jobs run periodically(using Quartz crontrigger) from client application.For example,

job1 : runs every 10 min .priority VERY_HIGH
job2 : runs every hour (it takes input from the output of job1).priority HIGH
job3 : runs every day(it takes input from the output of job2).priority NORMAL

.....

Everything working fine. But sometimes, multiple jobs can be triggered simultaneously, for example at 00:00 am job1,job2,job3 will be triggered.Even though job priorities set, due to available map slots, these jobs found to be executed parallel. So some input data missed for low priority jobs.

Brief: I need to execute strictly in FIFO based on job priority.Means it should be restricted in such a way that only single job runs at a time. ie, job1 finished, then job2 finished, job3 ..

I don't know how the hadoop schedulers can help me. Please advise.

Try changing these settings to 1:

mapred.tasktracker.map.tasks.maximum 1 mapred.tasktracker.reduce.tasks.maximum 1

If you will limit no of mapper and reducer to 1 then next job has to wait for next mapper to finish. If you look, it is not a good solution.

Using Oozie workflow engine would best suite your need.

I've been working on a new workflow engine called Soop. https://github.com/radixCSgeek/soop it is very lightweight and simple to setup and run using a cron-like syntax. You can specify the job dependencies (including virtual dependencies between jobs) and the DAG engine will make sure to execute them in the right order.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM