简体   繁体   中英

how to run/install oozie in EMR cluster

I want to orchestrate my EMR jobs. so I thought oozie will be good fit. I have done some POCs on oozie workflow but in local mode, its fairly simple and great.

But I dont understand how to use oozie on EMR cluster. Based on some search I got to know that aws doesnt come with oozie so we have install it explicitly as a bootstrap action. Most people point to this link https://github.com/lila/emr-oozie-sample

But since I am new to aws(EMR) I am still confused how to use it. It will be great, If anyone can simplify it for me providing some steps or something.

Thanks

I have had some question, which i posted to AWS technical support and i got below reply. I tried it and Oozie is all installed and running with no extra efforts required.

In order to have Oozie installed on an EMR cluster you need to install Hue. The reason is that currently Oozie on EMR is installed as a dependency for Hue. Hue is supported on AMIs 3.3.0 and 3.3.1 as per http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html . After launching an EMR cluster with Hue -> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hue.html installed you should be able to use Oozie immediately as it will be already configured and started.

EMR 4.x and 5.x series releases now come with Oozie as an optional application. There's also been a recent blog post on the AWS Big Data Blog outlining how to get started with it:

https://blogs.aws.amazon.com/bigdata/post/TxZ4KDBGBMZYJL/Use-Apache-Oozie-Workflows-to-Automate-Apache-Spark-Jobs-and-more-on-Amazon-EMR

That github project installs Oozie as well, so you don't need to take care of it. The configuration for the Oozie installation is in the next link:

https://github.com/lila/emr-oozie-sample/blob/master/config/config-oozie.sh

After that, there are some tasks you can execute from the command shell: create: ssh: sshproxy: socksproxy:

So, if you follow his instructions you only need to run some of this tasks in order to create and execute an EMR task using Oozie.

For those who are interested, I have cloned the repo and updated the Oozie installer script to support Hadoop 2.4.0 and Oozie 4.0.1

https://github.com/davideanastasia/emr-oozie-sample

Firstly, this is not a direct answer to this question.

EMR integrates with Data Pipeline - Amazon's own scheduler and data workflow orchestrator. Amazon expects you to use Data Pipeline with EMR. It can create, start and terminate EMR clusters, managing cluster lifecycle etc. Evaluate that to see if that fits your needs better..

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM