使用带有 pyspark 的 Jupyter Notebook 错误 no import named numpy

Question

As explained, I'm using pyspark in jupyter notebook.如前所述，我在 jupyter 笔记本中使用 pyspark。 I'm getting the attached errors.我收到附加的错误。

I have a tf-idf;我有一个 tf-idf； I normalize it;我把它标准化； then this last step creates a cosine-similarity matrix for documents.然后最后一步为文档创建一个余弦相似度矩阵。

from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix

mat = IndexedRowMatrix(
data.select("V2", "norm")\
    .rdd.map(lambda row: IndexedRow(row.ID, row.norm.toArray()))).toBlockMatrix()

But this is the error I'm getting: at the bottom it says "no module named numpy"但这是我得到的错误：在底部它说“没有名为 numpy 的模块”

2022-08-25 15:16:26,161 WARN scheduler.DAGScheduler: Broadcasting large task binary with size 4.0 MiB
2022-08-25 15:16:27,561 ERROR executor.Executor: Exception in task 0.0 in stage 8.0 (TID 15)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/Cellar/apache-spark/3.2.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 601, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/usr/local/Cellar/apache-spark/3.2.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 71, in read_command
command = serializer._read_with_length(file)
File "/usr/local/Cellar/apache-spark/3.2.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
return self.loads(obj)
File "/usr/local/Cellar/apache-spark/3.2.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
return pickle.loads(obj, encoding=encoding)
File "<frozen zipimport>", line 259, in load_module
File "/usr/local/Cellar/apache-spark/3.2.1/libexec/python/lib/pyspark.zip/pyspark/mllib/__init__.py", line 26, in <module>
  import numpy
ModuleNotFoundError: No module named 'numpy'

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:555)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:713)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:695)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:508)

Oddly, numpy is installed correctly.奇怪的是，numpy 安装正确。 So that's the issue: numpy is installed correctly but pyspark isn't able to find it while creating a cosine-similarity matrix in a jupyte notebook.这就是问题所在：numpy 安装正确，但 pyspark 在 jupyte 笔记本中创建余弦相似度矩阵时无法找到它。

Thank you for considering this.感谢您考虑这一点。

Answer 1

If you are running Jupyter Notebook as an application within AWS EMR, try using a bootstrap script which installs the required version of numpy while provisioning the cluster如果您在 AWS EMR 中将 Jupyter Notebook 作为应用程序运行，请尝试使用引导脚本，该脚本在预置集群时安装所需版本的 numpy

使用带有 pyspark 的 Jupyter Notebook 错误 no import named numpy

问题描述

1 个解决方案

解决方案1
1 2022-09-14 22:51:54

使用带有 pyspark 的 Jupyter Notebook 错误 no import named numpy

问题描述

1 个解决方案

解决方案1 1 2022-09-14 22:51:54

解决方案1
1 2022-09-14 22:51:54