简体   繁体   中英

How do I run a local Python script on a remote Spark cluster?

I have a local Python script running in a Jupyter notebook which runs jobs on my local Spark cluster running on my machine:

sc = pyspark.SparkContext(appName="test")
sqlCtx = pyspark.SQLContext(sc)

How do I change this to connection string to instead runs the jobs on my EMR Spark cluster in AWS?

Is this possible or do I have to use the spark-submit function when SSH'ing into the remote cluster?

You have to use spark-submit . I don't believe you can connect your local script to the EMR cluster because your master node would need to be local.

Here is a similar post that may be helpful: How to connect to Spark EMR from the locally running Spark Shell However, adding the Spark job as an EMR step is just another way of submitting code if you want the code to be used repetitively.

If your goal is to use Jupyter notebook on top of your EMR cluster, refer here. https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/

If you want to use Jupyter notebook and want to run your code on remote EMR cluster you can also use EMR notebook.

More information here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM