[英]Amazon EMR Pyspark Module not found
I created an Amazon EMR cluster with Spark already on it. 我已经创建了一个包含Spark的Amazon EMR集群。 When I run pyspark from the terminal it goes into the pyspark terminal when I ssh into my cluster.
当我从终端运行pyspark时,当我进入我的集群时,它会进入pyspark终端。
I uploaded a file using scp, and when I try to run it with python FileName.py, I get an import error: 我使用scp上传了一个文件,当我尝试使用python FileName.py运行它时,出现导入错误:
from pyspark import SparkContext
ImportError: No module named pyspark
How do I fix this? 我该如何解决?
I add the following lines to ~/.bashrc
for emr 4.3: 我为emr 4.3添加以下行到
~/.bashrc
:
export SPARK_HOME=/usr/lib/spark
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.XXX-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
Here py4j-0.XXX-src.zip
is the py4j file in your spark python library folder. 这里
py4j-0.XXX-src.zip
是你的spark python库文件夹中的py4j文件。 Search /usr/lib/spark/python/lib/
to find the exact version and replace the XXX
with that version number. 搜索
/usr/lib/spark/python/lib/
以查找确切的版本并将XXX
替换为该版本号。
Run source ~/.bashrc
and you should be good. 运行
source ~/.bashrc
,你应该很好。
You probably need to add the pyspark files to the path. 您可能需要将pyspark文件添加到路径中。 I typically use a function like the following.
我通常使用如下函数。
def configure_spark(spark_home=None, pyspark_python=None):
spark_home = spark_home or "/path/to/default/spark/home"
os.environ['SPARK_HOME'] = spark_home
# Add the PySpark directories to the Python path:
sys.path.insert(1, os.path.join(spark_home, 'python'))
sys.path.insert(1, os.path.join(spark_home, 'python', 'pyspark'))
sys.path.insert(1, os.path.join(spark_home, 'python', 'build'))
# If PySpark isn't specified, use currently running Python binary:
pyspark_python = pyspark_python or sys.executable
os.environ['PYSPARK_PYTHON'] = pyspark_python
Then, you can call the function before importing pyspark: 然后,您可以在导入pyspark之前调用该函数:
configure_spark('/path/to/spark/home')
from pyspark import SparkContext
Spark home on an EMR node should be something like /home/hadoop/spark
. EMR节点上的Spark home应该是
/home/hadoop/spark
。 See https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 for more details. 有关详细信息,请参阅https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 。
您可以使用以下命令从命令行直接执行文件:
spark-submit FileName.py
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.