简体   繁体   English

AWS EMR Spark“没有名为pyspark的模块”

[英]AWS EMR Spark “No Module named pyspark”

I created a spark cluster, ssh into the master, and launch the shell: 我创建了一个火花集群,使用ssh进入主服务器,然后启动shell:

MASTER=yarn-client ./spark/bin/pyspark

When I do the following: 当我执行以下操作时:

x = sc.textFile("s3://location/files.*")
xt = x.map(lambda x: handlejson(x))
table= sqlctx.inferSchema(xt)

I get the following error: 我收到以下错误:

Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  /mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/11/spark-assembly-1.1.0-hadoop2.4.0.jar
java.io.EOFException
        java.io.DataInputStream.readInt(DataInputStream.java:392)
        org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:151)
        org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:78)
        org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:54)
        org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:97)
        org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66)
        org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
        org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
        org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
        org.apache.spark.scheduler.Task.run(Task.scala:54)
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        java.lang.Thread.run(Thread.java:745)

I also checked PYTHONPATH 我还检查了PYTHONPATH

 >>> os.environ['PYTHONPATH'] '/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip:/home/hadoop/spark/python/:/‌​home/hadoop/spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar'

And looked inside the jar for pyspark, and it's there: 然后在罐子里寻找pyspark,它在那里:

jar -tf /home/hadoop/spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar | grep pyspark
pyspark/
pyspark/shuffle.py
pyspark/resultiterable.py
pyspark/files.py
pyspark/accumulators.py
pyspark/sql.py
pyspark/java_gateway.py
pyspark/join.py
pyspark/serializers.py
pyspark/shell.py
pyspark/rddsampler.py
pyspark/rdd.py
....

Has anyone run into this before? 有人遇到过吗? Thanks! 谢谢!

You'll want to reference these Spark issues: 您将要参考以下Spark问题:

The solution (assuming you would rather not rebuild your jar): 解决方案(假设您宁愿不重建您的jar):

unzip -d foo spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar
cd foo
# if you don't have openjdk 1.6:
# yum install -y java-1.6.0-openjdk-devel.x86_64
/usr/lib/jvm/openjdk-1.6.0/bin/jar cvmf META-INF/MANIFEST ../spark/lib/spark-assembly-1.1.0-hadoop2.4.0.jar .
# don't neglect the dot at the end of that command

This is fixed in later builds on EMR. 在以后的EMR版本中已解决此问题。 See https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark for release notes and instructions. 有关发行说明和说明,请参见https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 EMR 中运行 Jupyter notebook 时没有名为“pyspark”的模块 - No module named 'pyspark' when running Jupyter notebook inside EMR AWS EMR 上的 pyspark 提交命令 - pyspark submit command on AWS EMR 未找到Amazon EMR Pyspark模块 - Amazon EMR Pyspark Module not found 即使使用较旧的 spark 版本,也没有名为“pyspark.streaming.kafka”的模块 - No module named 'pyspark.streaming.kafka' even with older spark version MacOS ImportError上的Spark安装和配置:没有名为pyspark的模块 - Spark Installation and Configuration on MacOS ImportError: No module named pyspark Elephas未加载到PySpark中:没有名为elephas.spark_model的模块 - Elephas not loaded in PySpark: No module named elephas.spark_model Pyspark ImportError:没有名为spark_df_profiling的模块 - Pyspark ImportError: No module named spark_df_profiling 通过 vscode jupyter 服务器运行的 Jupyter Notebook 出现 ModuleNotFoundError: No module named from pyspark on Amazon EMR - Jupyter Notebook running through vscode jupyter server getting ModuleNotFoundError: No module named from pyspark on Amazon EMR AWS EMR Spark Python 日志记录 - AWS EMR Spark Python Logging Jupyter pyspark:没有名为 pyspark 的模块 - Jupyter pyspark : no module named pyspark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM