简体   繁体   English

我可以像在本地群集上一样在EMR上运行作业吗

[英]Can I run a job on EMR like on my local cluster

I have build a local cluster on my laptop (pseudo mode). 我已经在笔记本电脑上建立了本地群集(伪模式)。 Where I run different mapreduce commands like 我在哪里运行不同的mapreduce命令,例如

hadoop-streaming -D mapred.output.compress=true \
   -D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
   -files my_mapper.py,my_reducer.py \
  -mapper my_mapper.py  \
  -reducer my_reducer.py \
  -input /aws/input/input_warc.txt \
  -output /aws/output

Now I have to run it on EMR. 现在,我必须在EMR上运行它。 There are two options that can be used one is console and second is aws cli. 可以使用两个选项,一个是控制台,第二个是aws cli。 I want to run exactly comands like above. 我想像上面一样运行命令。 For that, I think if I ssh to EMR master, then I should be able to run this command. 为此,我认为如果我使用ssh到EMR主设备,则应该可以运行此命令。 Is it a right way or is there any drawback of this approch ? 这是正确的方法还是此方法有任何缺点?

Yes, you may SSH to your cluster and run your jobs there, but you may also use the Step API ( http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-steps.html ) to run arbitrary commands on the master instance, including of course running distributed jobs like your example. 是的,您可以通过SSH连接到群集并在其中运行作业,但是您也可以使用Step API( http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-steps.html )运行任意主实例上的命令,当然包括像您的示例一样运行分布式作业。 You may add Steps to a cluster using the AWS CLI ("aws emr add-step ..." or also during cluster creation using "aws emr create-cluster ... --steps ...") or similarly using the AWS SDKs (like the AWS Java SDK) or using the AWS EMR Console. 您可以使用AWS CLI(“ aws emr add-step ...”或在集群创建期间使用“ aws emr create-cluster ... --steps ...”)将步骤添加到群集中,也可以类似地使用AWS SDK(例如AWS Java SDK)或使用AWS EMR控制台。

Some advantages of the Step API include that it captures the output of each step so that you can view it via the AWS CLI, SDK, or AWS Console, and you can also check the status of Steps to determine when they have completed. Step API的一些优点包括它捕获每个步骤的输出,以便您可以通过AWS CLI,SDK或AWS Console查看它,还可以检查Steps的状态以确定它们何时完成。

One disadvantage of the Step API is that currently Steps all run sequentially, so you can't have multiple Steps running in parallel. Step API的一个缺点是当前所有步骤均按顺序运行,因此您不能并行运行多个步骤。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM