简体   繁体   English

Jupyter + EMR + Spark - 从本地机器上的 Jupyter notebook 连接到 EMR 集群

[英]Jupyter + EMR + Spark - Connect to EMR cluster from Jupyter notebook on local machine

I am new to PySpark and EMR.我是 PySpark 和 EMR 的新手。
I am trying to access Spark running on EMR cluster through Jupyter notebook, but running into errors.我正在尝试通过 Jupyter notebook 访问在 EMR 集群上运行的 Spark,但遇到了错误。

I am generating SparkSession using following code:我正在使用以下代码生成 SparkSession:

spark = SparkSession.builder \
    .master("local[*]")\
    .appName("parallelization on Spark")\
    .getOrCreate()

Tried following to access Remote cluster, but it errored out:尝试以下访问远程集群,但它出错了:

spark = SparkSession.builder \
    .master("spark://<remote-emr-ec2-hostname>:7077")\
    .appName("parallelization on Spark")\
    .getOrCreate()

Error:错误:

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NullPointerException
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:567)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

Any help resolving this would be much appreciated.任何解决此问题的帮助将不胜感激。

EMR clusters have Jupyter and JupyterHub provisioned for you since EMR version 5.14.0 . 自 EMR 版本 5.14.0 起,EMR 集群就为您预置了 Jupyter 和 JupyterHub

Most likely, it is easier to tune those provisioned services up with some extra bootstrap actions than to wire up your local process to talk to the EMR master node.最有可能的是,通过一些额外的引导操作来调整这些预置服务比连接本地进程与 EMR 主节点对话更容易。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将变量从 EMR 集群传递到 Jupyter Notebook %%local 实例? - How to Pass Variable from EMR Cluster to Jupyter Notebook %%local Instance? 无法访问 EMR 集群 jupyter notebook 中的 pyspark - Cannot access pyspark in EMR cluster jupyter notebook 在浏览器中访问安装在EMR 4.3.0上的Jupyter Notebook - Access Jupyter Notebook installed on EMR 4.3.0 in browser 通过 vscode jupyter 服务器运行的 Jupyter Notebook 出现 ModuleNotFoundError: No module named from pyspark on Amazon EMR - Jupyter Notebook running through vscode jupyter server getting ModuleNotFoundError: No module named from pyspark on Amazon EMR 在AWS EMR v4.0.0上使用Pyspark配置Ipython / Jupyter笔记本 - Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0 如何使 matplotlib 在 AWS EMR Jupyter 笔记本中工作? - How do I make matplotlib work in AWS EMR Jupyter notebook? 在 EMR 中运行 Jupyter notebook 时没有名为“pyspark”的模块 - No module named 'pyspark' when running Jupyter notebook inside EMR 通过在 Jupyter Notebook 中不起作用的引导操作在 EMR 上安装包 - Install packages on EMR via bootstrap actions not working in Jupyter notebook 在 EMR 中运行 Jupyter PySpark notebook,虽然已安装,但未找到模块 - Running Jupyter PySpark notebook in EMR, module not found, although it is installed 通过 python 代码在 jupyter notebook 中创建 EMR 步骤函数 - create an EMR step function inside jupyter notebook via python code
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM