我似乎无法在 Spark 上使用 --py-files

Question

I'm having a problem with using Python on Spark.我在 Spark 上使用 Python 时遇到问题。 My application has some dependencies, such as numpy, pandas, astropy, etc. I cannot use virtualenv to create an environment with all dependencies, since the nodes on the cluster do not have any common mountpoint or filesystem, besides HDFS.我的应用程序有一些依赖项，例如 numpy、pandas、astropy 等。我不能使用 virtualenv 创建具有所有依赖项的环境，因为除了 HDFS 之外，集群上的节点没有任何公共挂载点或文件系统。 Therefore I am stuck with using spark-submit --py-files .因此我坚持使用spark-submit --py-files 。 I package the contents of site-packages in a ZIP file and submit the job like with --py-files=dependencies.zip option (as suggested in Easiest way to install Python dependencies on Spark executor nodes? ).我将站点包的内容打包在一个 ZIP 文件中，并像使用--py-files=dependencies.zip选项一样提交作业（如在 Spark 执行程序节点上安装 Python 依赖项的最简单方法中所建议的那样？）。 However, the nodes on cluster still do not seem to see the modules inside and they throw ImportError such as this when importing numpy.但是，集群上的节点似乎仍然没有看到内部的模块，并且在导入 numpy 时会抛出诸如此类的ImportError 。

File "/path/anonymized/module.py", line 6, in <module>
    import numpy
File "/tmp/pip-build-4fjFLQ/numpy/numpy/__init__.py", line 180, in <module>   
File "/tmp/pip-build-4fjFLQ/numpy/numpy/add_newdocs.py", line 13, in <module>
File "/tmp/pip-build-4fjFLQ/numpy/numpy/lib/__init__.py", line 8, in <module>
    #
File "/tmp/pip-build-4fjFLQ/numpy/numpy/lib/type_check.py", line 11, in <module>
File "/tmp/pip-build-4fjFLQ/numpy/numpy/core/__init__.py", line 14, in <module>
ImportError: cannot import name multiarray

When I switch to the virtualenv and use the local pyspark shell, everything works fine, so the dependencies are all there.当我切换到 virtualenv 并使用本地 pyspark shell 时，一切正常，因此依赖项都在那里。 Does anyone know, what might cause this problem and how to fix it?有谁知道，可能导致此问题的原因以及如何解决？

Thanks!谢谢！

Answer 1

First off, I'll assume that your dependencies are listed in requirements.txt .首先，我假设您的依赖项列在requirements.txt 。 To package and zip the dependencies, run the following at the command line:要打包和压缩依赖项，请在命令行中运行以下命令：

pip install -t dependencies -r requirements.txt
cd dependencies
zip -r ../dependencies.zip .

Above, the cd dependencies command is crucial to ensure that the modules are the in the top level of the zip file.上面， cd dependencies命令对于确保模块位于 zip 文件的顶层至关重要。 Thanks to Dan Corin's post for heads up.感谢Dan Corin 的提醒。

Next, submit the job via:接下来，通过以下方式提交作业：

spark-submit --py-files dependencies.zip spark_job.py

The --py-files directive sends the zip file to the Spark workers but does not add it to the PYTHONPATH (source of confusion for me). --py-files指令将 zip 文件发送给 Spark 工作人员，但没有将其添加到PYTHONPATH （让我感到困惑的根源）。 To add the dependencies to the PYTHONPATH to fix the ImportError , add the following line to the Spark job, spark_job.py :要将依赖项添加到PYTHONPATH以修复ImportError ，请将以下行添加到 Spark 作业spark_job.py ：

sc.addPyFile("dependencies.zip")

A caveat from this Cloudera post : 这个 Cloudera 帖子的警告：

An assumption that anyone doing distributed computing with commodity hardware must assume is that the underlying hardware is potentially heterogeneous.任何使用商品硬件进行分布式计算的人都必须假设底层硬件可能是异构的。 A Python egg built on a client machine will be specific to the client's CPU architecture because of the required C compilation.由于需要 C 编译，因此在客户端机器上构建的 Python egg 将特定于客户端的 CPU 架构。 Distributing an egg for a complex, compiled package like NumPy, SciPy, or pandas is a brittle solution that is likely to fail on most clusters, at least eventually.为 NumPy、SciPy 或 Pandas 等复杂的编译包分发鸡蛋是一种脆弱的解决方案，可能在大多数集群上失败，至少最终是这样。

Although the solution above does not build an egg, the same guideline applies.虽然上面的解决方案没有构建一个鸡蛋，但同样的准则也适用。

Answer 2

First you need to pass your files through --py-files or --files首先，您需要通过--py-files或--files传递文件
- When you pass your zip/files with the above flags, basically your resources will be transferred to temporary directory created on HDFS just for the lifetime of that application.当您传递带有上述标志的 zip/文件时，基本上您的资源将被传输到在 HDFS 上创建的临时目录，仅用于该应用程序的生命周期。
Now in your code, add those zip/files by using the following command现在在您的代码中，使用以下命令添加这些 zip/文件
sc.addPyFile("your zip/file")
- what the above does is, it loads the files to the execution environment, like JVM.上面所做的是，它将文件加载到执行环境中，例如 JVM。
Now import your zip/file in your code with an alias like the following to start referencing it现在使用如下别名在代码中导入您的 zip/文件以开始引用它
import zip/file as your-alias
Note: You need not use file extension while importing, like .py at the end注意：导入时不需要使用文件扩展名，如末尾的.py

Hope this is useful.希望这是有用的。

Answer 3

To get this dependency distribution approach to work with compiled extensions we need to do two things:为了让这种依赖分布方法与编译扩展一起工作，我们需要做两件事：

Run the pip install on the same OS as your target cluster (preferably on the master node of the cluster).在与目标集群相同的操作系统上运行 pip install（最好在集群的主节点上）。 This ensures compatible binaries are included in your zip.这可确保兼容的二进制文件包含在您的 zip 中。
Unzip your archive on the destination node.在目标节点上解压缩您的存档。 This is necessary since Python will not import compiled extensions from zip files.这是必要的，因为 Python 不会从 zip 文件导入已编译的扩展。 ( https://docs.python.org/3.8/library/zipimport.html ) ( https://docs.python.org/3.8/library/zipimport.html )

Using the following script to create your dependencies zip will ensure that you are isolated from any packages already installed on your system.使用以下脚本创建依赖项 zip 将确保您与系统上已安装的任何软件包隔离。 This assumes virtualenv is installed and requirements.txt is present in your current directory, and outputs a dependencies.zip with all your dependencies at the root level.这是假设的virtualenv安装并requirements.txt当前目录中存在，并输出dependencies.zip与所有的依赖在根级别。

env_name=temp_env

# create the virtual env
virtualenv --python=$(which python3) --clear /tmp/${env_name}

# activate the virtual env
source /tmp/${env_name}/bin/activate

# download and install dependencies
pip install -r requirements.txt

# package the dependencies in dependencies.zip. the cd magic works around the fact that you can't specify a base dir to zip
(cd /tmp/${env_name}/lib/python*/site-packages/ && zip -r - *) > dependencies.zip

The dependencies can now be deployed, unzipped, and included in the PYTHONPATH as so现在可以像这样部署、解压缩和包含在 PYTHONPATH 中的依赖项

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --conf 'spark.yarn.dist.archives=dependencies.zip#deps' \
  --conf 'spark.yarn.appMasterEnv.PYTHONPATH=deps' \
  --conf 'spark.executorEnv.PYTHONPATH=deps' \
.
.
.

spark.yarn.dist.archives=dependencies.zip#deps spark.yarn.dist.archives=dependencies.zip#deps
distributes your zip file and unzips it to a directory called deps分发您的 zip 文件并将其解压缩到名为deps的目录

spark.yarn.appMasterEnv.PYTHONPATH=deps spark.yarn.appMasterEnv.PYTHONPATH=deps
spark.executorEnv.PYTHONPATH=deps spark.executorEnv.PYTHONPATH=deps
includes the deps directory in the PYTHONPATH for the master and all workers在 PYTHONPATH 中为 master 和所有 worker 包含deps目录

--deploy-mode cluster --deploy-mode 集群
runs the master executor on the cluster so it picks up the dependencies在集群上运行主执行程序，以便它获取依赖项

Answer 4

You can locate all the .pys you need and add them relatively.您可以找到您需要的所有 .pys 并相对添加它们。 see here for this explanation:有关此说明，请参见此处：

import os, sys, inspect
 # realpath() will make your script run, even if you symlink it :)
 cmd_folder = os.path.realpath(os.path.abspath(os.path.split(inspect.getfile( inspect.currentframe() ))[0]))
 if cmd_folder not in sys.path:
     sys.path.insert(0, cmd_folder)

 # use this if you want to include modules from a subfolder
 cmd_subfolder = os.path.realpath(os.path.abspath(os.path.join(os.path.split(inspect.getfile( inspect.currentframe() ))[0],"subfolder")))
 if cmd_subfolder not in sys.path:
     sys.path.insert(0, cmd_subfolder)

 # Info:
 # cmd_folder = os.path.dirname(os.path.abspath(__file__)) # DO NOT USE __file__ !!!
 # __file__ fails if script is called in different ways on Windows
 # __file__ fails if someone does os.chdir() before
 # sys.argv[0] also fails because it doesn't not always contains the path

Answer 5

Spark will also silently fail to load a zip archive that is created with the python zipfile module. Spark 也会默默地无法加载使用 python zipfile模块创建的 zip 存档。 Zip archives must be created using a zip utility.必须使用 zip 实用程序创建 Zip 存档。

Answer 6

Try to use --archives to archive your anaconda dir to each server and use --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON= to tell your spark server where is your python executor path in your anaconda dir.尝试使用--archives将您的 anaconda 目录存档到每个服务器，并使用--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=告诉您的--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=服务器您的 anaconda 目录中的 python 执行程序路径在哪里。

Our full config is this:我们的完整配置是这样的：

--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./ANACONDA/anaconda-dependencies/bin/python 

--archives <S3-path>/anaconda-dependencies.zip#ANACONDA

Answer 7

As Andrej Palicka expained in the comments,正如 Andrej Palicka 在评论中解释的那样，

"the problem lies in the fact, that Python cannot import .so modules from .zip files (docs.python.org/2/library/zipimport.html)". “问题在于，Python 无法从 .zip 文件（docs.python.org/2/library/zipimport.html）导入 .so 模块”。

A solution that I found is to add the non .py files one by one to py-files separated by comas:我发现的一个解决方案是将非 .py 文件一一添加到由逗号分隔的 py 文件中：

spark-submit --py-files modules/toolbox.cpython-38-x86_64-linux-gnu.so,modules/product.cpython-38-x86_64-linux-gnu.so spark_fro

m_cython.py m_cython.py

我似乎无法在 Spark 上使用 --py-files

问题描述

7 个解决方案

解决方案1
83 2016-09-29 18:22:07

解决方案2
20 2016-09-29 20:18:49

解决方案3
6 已采纳 2020-09-17 22:13:03

解决方案4
0 2016-04-06 20:04:13

解决方案5
0 2017-07-27 20:34:54

解决方案6
0 2019-08-02 01:50:25

解决方案7
0 2021-09-12 04:59:01

我似乎无法在 Spark 上使用 --py-files

问题描述

7 个解决方案

解决方案1 83 2016-09-29 18:22:07

解决方案2 20 2016-09-29 20:18:49

解决方案3 6 已采纳 2020-09-17 22:13:03

解决方案4 0 2016-04-06 20:04:13

解决方案5 0 2017-07-27 20:34:54

解决方案6 0 2019-08-02 01:50:25

解决方案7 0 2021-09-12 04:59:01

解决方案1
83 2016-09-29 18:22:07

解决方案2
20 2016-09-29 20:18:49

解决方案3
6 已采纳 2020-09-17 22:13:03

解决方案4
0 2016-04-06 20:04:13

解决方案5
0 2017-07-27 20:34:54

解决方案6
0 2019-08-02 01:50:25

解决方案7
0 2021-09-12 04:59:01