[英]Using Jupyter Notebook with pyspark error no import named numpy
As explained, I'm using pyspark in jupyter notebook.如前所述,我在 jupyter 笔记本中使用 pyspark。 I'm getting the attached errors.我收到附加的错误。
I have a tf-idf;我有一个 tf-idf; I normalize it;我把它标准化; then this last step creates a cosine-similarity matrix for documents.然后最后一步为文档创建一个余弦相似度矩阵。
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
mat = IndexedRowMatrix(
data.select("V2", "norm")\
.rdd.map(lambda row: IndexedRow(row.ID, row.norm.toArray()))).toBlockMatrix()
But this is the error I'm getting: at the bottom it says "no module named numpy"但这是我得到的错误:在底部它说“没有名为 numpy 的模块”
2022-08-25 15:16:26,161 WARN scheduler.DAGScheduler: Broadcasting large task binary with size 4.0 MiB
2022-08-25 15:16:27,561 ERROR executor.Executor: Exception in task 0.0 in stage 8.0 (TID 15)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/Cellar/apache-spark/3.2.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 601, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/usr/local/Cellar/apache-spark/3.2.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 71, in read_command
command = serializer._read_with_length(file)
File "/usr/local/Cellar/apache-spark/3.2.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
return self.loads(obj)
File "/usr/local/Cellar/apache-spark/3.2.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
return pickle.loads(obj, encoding=encoding)
File "<frozen zipimport>", line 259, in load_module
File "/usr/local/Cellar/apache-spark/3.2.1/libexec/python/lib/pyspark.zip/pyspark/mllib/__init__.py", line 26, in <module>
import numpy
ModuleNotFoundError: No module named 'numpy'
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:555)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:713)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:695)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:508)
Oddly, numpy is installed correctly.奇怪的是,numpy 安装正确。 So that's the issue: numpy is installed correctly but pyspark isn't able to find it while creating a cosine-similarity matrix in a jupyte notebook.这就是问题所在:numpy 安装正确,但 pyspark 在 jupyte 笔记本中创建余弦相似度矩阵时无法找到它。
Thank you for considering this.感谢您考虑这一点。
If you are running Jupyter Notebook as an application within AWS EMR, try using a bootstrap script which installs the required version of numpy while provisioning the cluster如果您在 AWS EMR 中将 Jupyter Notebook 作为应用程序运行,请尝试使用引导脚本,该脚本在预置集群时安装所需版本的 numpy
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.