How should you run a jupyter notebook on Spark EMR Cluster

Question

EDIT: This question was on how you should define parameters for python/jupyetr-notebook file in order to make a spark-submit on an EMR Amazon Spark Cluster...

Before: I am sorry for my dumb questions, but I am pretty newbie and I am stuck on the issue for a couple of days, and it seems there is no good guide on the web. I am following the Udacity Spark course. I have created Spark Yarn cluster on Amazon AWS (EMR), with one master and 3 slaves. I have created a jupyter notebook on top of that (and was able to run and see output using PySpark kernel). I had connected using PuttY to the cluster (I guess to the master node), I have downloaded the jupyter notebook to the local machine. However, when I try to run it I am stuck consistently on many types of errors. Currently, I run these commands:

/usr/bin/spark-submit --class "org.apache.spark.examples.SparkPi" --master yarn --deploy-mode cluster ./my-test-emr.ipynb 1>output-my-test-emr.log 2>error-my-test-emr.log
aws s3 cp ./error-my-test-emr.log s3://aws-emr-resources-750982214328-us-east-2/notebooks/e-8TP55R4K894W1BFRTNHUGJ90N/error-my-test-emr.log

I made both the error file and the jupyter notebook public so you can see them( link ). I truly suspect the --class parameter (I pretty much guessed it, and I have read about it as an option for my troubles but no further information was given), can anyone give me an explanation what is it? Why do we need it? And how can I find out/set the true value? If anyone has the will so further explanation about JAR would be helpful - why should I turn my python program into java? And how should I do that? It seems like many questions have been asked here regarding it, but none explains it from the root...

Thanks in Advance

Answer 1

Export your notebook as .py file.
You do not need to specify --class for a python script.
You do not need to convert your python code to java/scala.
Once you have your .py file, with some name, say test.py , this will work

spark-submit --master yarn --deploy-mode cluster ./test.py

Answer 2

When you mean locally, what version of Spark you downloaded and from where?

Generally, when I configure Spark in my laptop, I just run below command to run the Spark Pi example

spark-submit --class org.apache.spark.examples.SparkPi --master yarn \
--deploy-mode client SPARK_HOME/lib/spark-examples.jar 10

Where SPARK_HOME is the folder where you extract the tarball from the Spark website .

How should you run a jupyter notebook on Spark EMR Cluster

Question

2 answers

solution1
1 2020-07-13 01:55:53

solution2
0 2020-07-10 13:52:52

How should you run a jupyter notebook on Spark EMR Cluster

Question

2 answers

solution1 1 2020-07-13 01:55:53

solution2 0 2020-07-10 13:52:52

solution1
1 2020-07-13 01:55:53

solution2
0 2020-07-10 13:52:52