[英]Jupyter + EMR + Spark - Connect to EMR cluster from Jupyter notebook on local machine
I am new to PySpark and EMR.我是 PySpark 和 EMR 的新手。
I am trying to access Spark running on EMR cluster through Jupyter notebook, but running into errors.我正在尝试通过 Jupyter notebook 访问在 EMR 集群上运行的 Spark,但遇到了错误。
I am generating SparkSession using following code:我正在使用以下代码生成 SparkSession:
spark = SparkSession.builder \
.master("local[*]")\
.appName("parallelization on Spark")\
.getOrCreate()
Tried following to access Remote cluster, but it errored out:尝试以下访问远程集群,但它出错了:
spark = SparkSession.builder \
.master("spark://<remote-emr-ec2-hostname>:7077")\
.appName("parallelization on Spark")\
.getOrCreate()
Error:错误:
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NullPointerException
at org.apache.spark.SparkContext.<init>(SparkContext.scala:567)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
Any help resolving this would be much appreciated.任何解决此问题的帮助将不胜感激。
EMR clusters have Jupyter and JupyterHub provisioned for you since EMR version 5.14.0 . 自 EMR 版本 5.14.0 起,EMR 集群就为您预置了 Jupyter 和 JupyterHub 。
Most likely, it is easier to tune those provisioned services up with some extra bootstrap actions than to wire up your local process to talk to the EMR master node.最有可能的是,通过一些额外的引导操作来调整这些预置服务比连接本地进程与 EMR 主节点对话更容易。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.