简体   繁体   English

如何在 Spark EMR 集群上运行 jupyter notebook

[英]How should you run a jupyter notebook on Spark EMR Cluster

EDIT: This question was on how you should define parameters for python/jupyetr-notebook file in order to make a spark-submit on an EMR Amazon Spark Cluster...编辑:这个问题是关于如何为 python/jupyetr-notebook 文件定义参数,以便在 EMR Amazon Spark 集群上进行 spark-submit ...

Before: I am sorry for my dumb questions, but I am pretty newbie and I am stuck on the issue for a couple of days, and it seems there is no good guide on the web.之前:我很抱歉我的愚蠢问题,但我是个新手,我在这个问题上停留了几天,而且似乎没有关于 web 的好的指南。 I am following the Udacity Spark course.我正在关注 Udacity Spark 课程。 I have created Spark Yarn cluster on Amazon AWS (EMR), with one master and 3 slaves.我在 Amazon AWS (EMR) 上创建了 Spark Yarn 集群,有 1 个主节点和 3 个从节点。 I have created a jupyter notebook on top of that (and was able to run and see output using PySpark kernel).我在此基础上创建了一个 jupyter 笔记本(并且能够使用 PySpark 内核运行并查看 output)。 I had connected using PuttY to the cluster (I guess to the master node), I have downloaded the jupyter notebook to the local machine.我已经使用 PuttY 连接到集群(我猜是主节点),我已经将 jupyter notebook 下载到了本地机器上。 However, when I try to run it I am stuck consistently on many types of errors.但是,当我尝试运行它时,我总是遇到许多类型的错误。 Currently, I run these commands:目前,我运行这些命令:

/usr/bin/spark-submit --class "org.apache.spark.examples.SparkPi" --master yarn --deploy-mode cluster ./my-test-emr.ipynb 1>output-my-test-emr.log 2>error-my-test-emr.log
aws s3 cp ./error-my-test-emr.log s3://aws-emr-resources-750982214328-us-east-2/notebooks/e-8TP55R4K894W1BFRTNHUGJ90N/error-my-test-emr.log

I made both the error file and the jupyter notebook public so you can see them( link ).我将错误文件和 jupyter notebook 都公开了,这样你就可以看到它们( 链接)。 I truly suspect the --class parameter (I pretty much guessed it, and I have read about it as an option for my troubles but no further information was given), can anyone give me an explanation what is it?我真的怀疑 --class 参数(我几乎猜到了,我已经阅读了它作为我的麻烦的一个选项,但没有给出进一步的信息),谁能给我解释它是什么? Why do we need it?为什么我们需要它? And how can I find out/set the true value?我怎样才能找出/设置真正的价值? If anyone has the will so further explanation about JAR would be helpful - why should I turn my python program into java?如果有人愿意进一步解释 JAR 会有所帮助 - 为什么我要把我的 python 程序变成 java? And how should I do that?我该怎么做? It seems like many questions have been asked here regarding it, but none explains it from the root...似乎这里已经提出了很多关于它的问题,但没有人从根本上解释它......

Thanks in Advance提前致谢

  1. Export your notebook as .py file.将您的笔记本导出为.py文件。
  2. You do not need to specify --class for a python script.您不需要为 python 脚本指定--class
  3. You do not need to convert your python code to java/scala.您不需要将 python 代码转换为 java/scala。
  4. Once you have your .py file, with some name, say test.py , this will work一旦你有了你的.py文件,有一些名字,比如test.py ,这将起作用
spark-submit --master yarn --deploy-mode cluster ./test.py

When you mean locally, what version of Spark you downloaded and from where?当您的意思是本地时,您从哪里下载了什么版本的 Spark?

Generally, when I configure Spark in my laptop, I just run below command to run the Spark Pi example通常,当我在笔记本电脑中配置 Spark 时,我只需运行以下命令即可运行 Spark Pi 示例

spark-submit --class org.apache.spark.examples.SparkPi --master yarn \
--deploy-mode client SPARK_HOME/lib/spark-examples.jar 10

Where SPARK_HOME is the folder where you extract the tarball from the Spark website .其中 SPARK_HOME 是您从 Spark网站提取 tarball 的文件夹。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM