简体   繁体   English

AWS EMR Spark 1.0

[英]AWS EMR Spark 1.0

Is there a way to force Amazon EMR to use Spark 1.0.1? 有没有一种方法可以强制Amazon EMR使用Spark 1.0.1? The current selectable versions stop at 1.4.1. 当前可选版本停止在1.4.1。

I am using the Alternating Least Squares implementation within MLlib, and since v1.1 they have implemented weighted regularization and for specific reasons (research study) I do not want this implementation, rather I am trying to access the non-weighted regularization version they had implemented in v1.0. 我正在MLlib中使用交替最小二乘实现,并且从v1.1开始,他们已经实现了加权正则化,并且出于特定原因(研究),我不希望这种实现,而是尝试访问他们拥有的非加权正则化版本在v1.0中实现。

I am using Zepplin notebooks with Scala if that helps. 如果有帮助,我将Zepplin笔记本与Scala一起使用。

Is working with Zeppelin a requirement? 是否需要使用齐柏林飞艇? Because if so, it could be very difficult. 因为如果是这样,可能会非常困难。 Zeppelin is compiled against a specific version of Spark so downgrading the jar will most likely fail. Zeppelin是针对特定版本的Spark编译的,因此降级jar很可能会失败。

Otherwise, if you are ok with not using Zeppelin and instead using the EMR Step API, then you might be able to spin up an EMR cluster with a bootstrap action that installs spark-assembly 1.0.1. 否则,如果您可以不使用Zeppelin,而可以使用EMR Step API,则可以通过安装了spark-assembly 1.0.1的引导程序启动EMR集群。 I said it might work, because there's no guarantee that the current EMR version is compatible with a 2 year old version of Spark. 我说这可能有用,因为不能保证当前的EMR版本与2年以前的Spark版本兼容。

To create the cluster: 要创建集群:

To run spark using the EMR Step API: 要使用EMR Step API运行spark:

  • Upload your compiled jar to s3, then submit a step against that cluster 将已编译的jar上载到s3,然后针对该集群提交步骤
  • Cluster ID: the id of your cluster (ex j-XXXXXXXX) 群集ID:群集的ID(例如j-XXXXXXXX)
  • Region of cluster. 群集区域。 Where you created your EMR cluster. 创建EMR群集的位置。 Ex us-west-2 Ex us-west-2
  • Your spark main class: This is where you put your ml pipeline code. 您的spark主类:这是您放置ml管道代码的地方。
  • Your jar: you have to upload the jar with your code to S3 so your cluster can download it 您的jar:您必须将带有代码的jar上载到S3,以便您的集群可以下载它
  • arg1, arg2: arguments to your main (optional) arg1,arg2:主参数(可选)

aws emr add-steps --cluster-id --steps \\ Name=SparkPi,Jar=s3://.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn,--class,com.your.spark.class.MainApp,s3://>/your.jar,arg1,arg2],ActionOnFailure=CONTINUE aws emr add-steps --cluster-id --steps \\ Name = SparkPi,Jar = s3://.elasticmapreduce/libs/script-runner/script-runner.jar,Args= [/ home / hadoop / spark / bin / spark-submit,-部署模式,群集,-master,yarn,-class,com.your.spark.class.MainApp,s3://> /your.jar,arg1,arg2],ActionOnFailure =继续

(Taken from the official github repo at https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/spark-submit-via-step.md ) (摘自https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/spark-submit-via-step.md的官方github存储库)

Also if that fails, install Hadoop and check out https://spark.apache.org/docs/1.0.1/running-on-yarn.html 另外,如果失败,请安装Hadoop并查看https://spark.apache.org/docs/1.0.1/running-on-yarn.html

Or you could also run 1.0.1 locally on your laptop if your data is small. 或者,如果数据很小,您也可以在笔记本电脑上本地运行1.0.1。

Good luck. 祝好运。

Amazon EMR provide a list of supported versions of software packages you can install by selecting a drop menu. Amazon EMR提供了受支持软件包的版本列表,您可以通过选择下拉菜单来安装它们。 Nothing stop you from installing additional custom software with a bootstrap action . 没有什么可以阻止您通过引导操作安装其他自定义软件的。 I had some experience installing java 8 when EMR was supporting only Java 7. It is a bit painful but totally possible. 当EMR仅支持Java 7时,我有安装Java 8的经验。这有点痛苦,但完全有可能。

EMR supports Spark 1.6.0. EMR支持Spark 1.6.0。 Take a look at their latest release of emr-4.4.0: http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-whatsnew.html 看看他们最新的emr-4.4.0版本: http : //docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-whatsnew.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM