[英]No module named 'pyspark' when running Jupyter notebook inside EMR
I am (very) new to AWS and Spark in general, and I'm trying to run a notebook instance in Amazon EMR.总的来说,我对 AWS 和 Spark(非常)陌生,我正在尝试在 Amazon EMR 中运行笔记本实例。 When I try to import pyspark to start a session and load data from s3, I get the error No module named 'pyspark'.当我尝试导入 pyspark 以启动 session 并从 s3 加载数据时,我收到错误 No module named 'pyspark'。 The cluster I created had the Spark option filled, what am I doing wrong?我创建的集群填充了 Spark 选项,我做错了什么?
The only solution that worked for me was to change the notebook kernel to the PySpark kernel, then changing the bootstrap action to install packages (in python version3.6) that are not by default in the pyspark kernel: The only solution that worked for me was to change the notebook kernel to the PySpark kernel, then changing the bootstrap action to install packages (in python version3.6) that are not by default in the pyspark kernel:
#!/bin/bash
sudo python3.6 -m pip install numpy \
matplotlib \
pandas \
seaborn \
pyspark
Apparently by default it installs to python 2.7.16, so it outputs no error message but you can't import the modules because the spark env uses Python 2.7.16.显然默认情况下它安装到 python 2.7.16,因此它不会输出错误消息,但您无法导入模块,因为 spark env 使用 Python 2.7.16。
You can open jupyter lab notebook and select new spark notebook from there.您可以从那里打开 jupyter lab notebook 和 select new spark notebook。 This will initiate the spark context automatically for you.这将为您自动启动火花上下文。
Or you can open Jupyter notebook and load spark app by %%spark
或者您可以打开 Jupyter notebook 并通过%%spark
加载 spark 应用程序
You Could try using the findspark library.您可以尝试使用 findspark 库。 Could pip install findspark and below code in your jupyter. pip 能否在您的 jupyter 中安装 findspark 及以下代码。
import findspark
findspark.init()
%load_ext sparksql_magic
%config SparkSql.limit=200
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.