简体   繁体   English

在EMR上的Spark上使用Jupyter笔记本

[英]Using Jupyter notebook on Spark on EMR

I am new to spark and AWS, I am trying to install Jupyter on my Spark cluster (EMR), i am not able to open Jupyter Notebook on my browser in the end. 我是Spark和AWS的新手,我正在尝试在我的Spark集群(EMR)上安装Jupyter,我最终无法在浏览器上打开Jupyter Notebook。

Context: I have firewall issues from the place i am working, i can't get access to the EMR clsuter's IP address i create on a day-to-day basis. 背景:我从我正在工作的地方遇到防火墙问题,我无法访问我日常创建的EMR clsuter的IP地址。 I have a dedicated EC-2 instance (IP address for this instance is white listed) that i am using as a client to connect to the EMR cluster i create on a need basis. 我有一个专用的EC-2实例(这个实例的IP地址是白名单),我用它作为客户端连接到我根据需要创建的EMR集群。

I have access to the IP address of the EC2 instance and the ports 22 and 8080. I do not have access to the IP address of EMR cluster. 我可以访问EC2实例的IP地址以及端口22和8080.我无权访问EMR群集的IP地址。

Following are the steps that i am following: 以下是我遵循的步骤:

  1. Open putty and connect to the EC2 instance 打开putty并连接到EC2实例
  2. Establish connection between my EC2 instance and EMR cluster ssh -i publickey.pem ec2-user@ host name of the EMR cluster 在我的EC2实例和EMR集群之间建立连接ssh -i publickey.pem ec2-user @ EMR集群的主机名
  3. install jupyter on the spark cluster using the following command: pip install jupyter 使用以下命令在spark集群上安装jupyter:pip install jupyter

  4. Connect to spark: PYSPARK_DRIVER_PYTHON=/usr/local/bin/jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777" pyspark --packages com.databricks:spark-csv_2.10:1.1.0 --master spark://127.0.0.1:7077 --executor-memory 6400M --driver-memory 6400M 连接到spark:PYSPARK_DRIVER_PYTHON = / usr / local / bin / jupyter PYSPARK_DRIVER_PYTHON_OPTS =“notebook --no-browser --port = 7777”pyspark --packages com.databricks:spark-csv_2.10:1.1.0 --master spark ://127.0.0.1:7077 --executor-memory 6400M --driver-memory 6400M

  5. Establish a tunnel to browser: ssh -L 0.0.0.0:8080:127.0.0.1:7777 ip-172-31-34-209 -i publickey.pem 建立到浏览器的隧道:ssh -L 0.0.0.0:8080:127.0.0.1:7777 ip-172-31-34-209 -i publickey.pem

  6. open Jupyter on browser: 在浏览器上打开Jupyter:

http:// host name of EMR cluster :8080 http:// EMR集群的主机名 :8080

I am able to run the first 5 steps, but not able to open the Jupyter notebook on my browser. 我可以运行前5个步骤,但无法在浏览器上打开Jupyter笔记本。

Didn't test it, as it involves setting up a test EMR server, but here's what should work: 没有测试它,因为它涉及设置测试EMR服务器,但这是应该工作的:

Step 5: 第5步:

ssh -i publickkey.pem -L 8080:127.0.0.1:7777 HOSTNAME

Step 6: 第6步:

Open jupyter notebook on browser using 127.0.0.1:8080 使用127.0.0.1:8080在浏览器上打开jupyter笔记本

You can use an EMR notebook with Amazon EMR clusters running Apache Spark to remotely run queries and code. 您可以将EMR笔记本与运行Apache Spark的Amazon EMR集群一起使用,以远程运行查询和代码。 An EMR notebook is a "serverless" Jupyter notebook. EMR笔记本是一款“无需服务器”的Jupyter笔记本。 EMR notebook sits outside the cluster and takes care of cluster attachment without you having to worry about it. EMR笔记本电脑位于集群外部,可以处理集群附件,而无需担心。

More information here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks.html 更多信息请访问: https//docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM