简体   繁体   中英

Can I run a job on EMR like on my local cluster

I have build a local cluster on my laptop (pseudo mode). Where I run different mapreduce commands like

hadoop-streaming -D mapred.output.compress=true \
   -D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
   -files my_mapper.py,my_reducer.py \
  -mapper my_mapper.py  \
  -reducer my_reducer.py \
  -input /aws/input/input_warc.txt \
  -output /aws/output

Now I have to run it on EMR. There are two options that can be used one is console and second is aws cli. I want to run exactly comands like above. For that, I think if I ssh to EMR master, then I should be able to run this command. Is it a right way or is there any drawback of this approch ?

Yes, you may SSH to your cluster and run your jobs there, but you may also use the Step API ( http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-steps.html ) to run arbitrary commands on the master instance, including of course running distributed jobs like your example. You may add Steps to a cluster using the AWS CLI ("aws emr add-step ..." or also during cluster creation using "aws emr create-cluster ... --steps ...") or similarly using the AWS SDKs (like the AWS Java SDK) or using the AWS EMR Console.

Some advantages of the Step API include that it captures the output of each step so that you can view it via the AWS CLI, SDK, or AWS Console, and you can also check the status of Steps to determine when they have completed.

One disadvantage of the Step API is that currently Steps all run sequentially, so you can't have multiple Steps running in parallel.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM