如何在 Spark EMR 集群上运行 jupyter notebook

Question

EDIT: This question was on how you should define parameters for python/jupyetr-notebook file in order to make a spark-submit on an EMR Amazon Spark Cluster...编辑：这个问题是关于如何为 python/jupyetr-notebook 文件定义参数，以便在 EMR Amazon Spark 集群上进行 spark-submit ...

Before: I am sorry for my dumb questions, but I am pretty newbie and I am stuck on the issue for a couple of days, and it seems there is no good guide on the web.之前：我很抱歉我的愚蠢问题，但我是个新手，我在这个问题上停留了几天，而且似乎没有关于 web 的好的指南。 I am following the Udacity Spark course.我正在关注 Udacity Spark 课程。 I have created Spark Yarn cluster on Amazon AWS (EMR), with one master and 3 slaves.我在 Amazon AWS (EMR) 上创建了 Spark Yarn 集群，有 1 个主节点和 3 个从节点。 I have created a jupyter notebook on top of that (and was able to run and see output using PySpark kernel).我在此基础上创建了一个 jupyter 笔记本（并且能够使用 PySpark 内核运行并查看 output）。 I had connected using PuttY to the cluster (I guess to the master node), I have downloaded the jupyter notebook to the local machine.我已经使用 PuttY 连接到集群（我猜是主节点），我已经将 jupyter notebook 下载到了本地机器上。 However, when I try to run it I am stuck consistently on many types of errors.但是，当我尝试运行它时，我总是遇到许多类型的错误。 Currently, I run these commands:目前，我运行这些命令：

/usr/bin/spark-submit --class "org.apache.spark.examples.SparkPi" --master yarn --deploy-mode cluster ./my-test-emr.ipynb 1>output-my-test-emr.log 2>error-my-test-emr.log
aws s3 cp ./error-my-test-emr.log s3://aws-emr-resources-750982214328-us-east-2/notebooks/e-8TP55R4K894W1BFRTNHUGJ90N/error-my-test-emr.log

I made both the error file and the jupyter notebook public so you can see them( link ).我将错误文件和 jupyter notebook 都公开了，这样你就可以看到它们（链接）。 I truly suspect the --class parameter (I pretty much guessed it, and I have read about it as an option for my troubles but no further information was given), can anyone give me an explanation what is it?我真的怀疑 --class 参数（我几乎猜到了，我已经阅读了它作为我的麻烦的一个选项，但没有给出进一步的信息），谁能给我解释它是什么？ Why do we need it?为什么我们需要它？ And how can I find out/set the true value?我怎样才能找出/设置真正的价值？ If anyone has the will so further explanation about JAR would be helpful - why should I turn my python program into java?如果有人愿意进一步解释 JAR 会有所帮助 - 为什么我要把我的 python 程序变成 java？ And how should I do that?我该怎么做？ It seems like many questions have been asked here regarding it, but none explains it from the root...似乎这里已经提出了很多关于它的问题，但没有人从根本上解释它......

Thanks in Advance提前致谢

Answer 1

Export your notebook as .py file.将您的笔记本导出为.py文件。
You do not need to specify --class for a python script.您不需要为 python 脚本指定--class 。
You do not need to convert your python code to java/scala.您不需要将 python 代码转换为 java/scala。
Once you have your .py file, with some name, say test.py , this will work一旦你有了你的.py文件，有一些名字，比如test.py ，这将起作用

spark-submit --master yarn --deploy-mode cluster ./test.py

Answer 2

When you mean locally, what version of Spark you downloaded and from where?当您的意思是本地时，您从哪里下载了什么版本的 Spark？

Generally, when I configure Spark in my laptop, I just run below command to run the Spark Pi example通常，当我在笔记本电脑中配置 Spark 时，我只需运行以下命令即可运行 Spark Pi 示例

spark-submit --class org.apache.spark.examples.SparkPi --master yarn \
--deploy-mode client SPARK_HOME/lib/spark-examples.jar 10

Where SPARK_HOME is the folder where you extract the tarball from the Spark website .其中 SPARK_HOME 是您从 Spark网站提取 tarball 的文件夹。

如何在 Spark EMR 集群上运行 jupyter notebook

问题描述

2 个解决方案

解决方案1
1 2020-07-13 01:55:53

解决方案2
0 2020-07-10 13:52:52

如何在 Spark EMR 集群上运行 jupyter notebook

问题描述

2 个解决方案

解决方案1 1 2020-07-13 01:55:53

解决方案2 0 2020-07-10 13:52:52

解决方案1
1 2020-07-13 01:55:53

解决方案2
0 2020-07-10 13:52:52