简体   繁体   中英

Amazon EMR Pyspark Module not found

I created an Amazon EMR cluster with Spark already on it. When I run pyspark from the terminal it goes into the pyspark terminal when I ssh into my cluster.

I uploaded a file using scp, and when I try to run it with python FileName.py, I get an import error:

from pyspark import SparkContext
ImportError: No module named pyspark

How do I fix this?

I add the following lines to ~/.bashrc for emr 4.3:

export SPARK_HOME=/usr/lib/spark
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.XXX-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

Here py4j-0.XXX-src.zip is the py4j file in your spark python library folder. Search /usr/lib/spark/python/lib/ to find the exact version and replace the XXX with that version number.

Run source ~/.bashrc and you should be good.

You probably need to add the pyspark files to the path. I typically use a function like the following.

def configure_spark(spark_home=None, pyspark_python=None):
    spark_home = spark_home or "/path/to/default/spark/home"
    os.environ['SPARK_HOME'] = spark_home

    # Add the PySpark directories to the Python path:
    sys.path.insert(1, os.path.join(spark_home, 'python'))
    sys.path.insert(1, os.path.join(spark_home, 'python', 'pyspark'))
    sys.path.insert(1, os.path.join(spark_home, 'python', 'build'))

    # If PySpark isn't specified, use currently running Python binary:
    pyspark_python = pyspark_python or sys.executable
    os.environ['PYSPARK_PYTHON'] = pyspark_python

Then, you can call the function before importing pyspark:

configure_spark('/path/to/spark/home')
from pyspark import SparkContext

Spark home on an EMR node should be something like /home/hadoop/spark . See https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 for more details.

Try using findspark : Install via shell using pip install findspark .

Sample code:

# Import package(s).
import findspark
findspark.init()

from pyspark import SparkContext
from pyspark.sql import SQLContext

您可以使用以下命令从命令行直接执行文件:

spark-submit FileName.py

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM