在 AWS EMR 上的 zip 文件中提交 pyspark 应用程序

Question

We've a pyspark-jobs repository which contains the zip artifacts in S3 after build process pushes it there.我们有一个 pyspark-jobs 存储库，其中包含构建过程将其推送到 S3 中的 zip 工件。 Let's say one such job is find-homes.zip whose contents are shown below:假设其中一项工作是find-homes.zip ，其内容如下所示：

find-homes.zip
+-find_homes
  +- __init__.py
  +- run.py
+-helpers
  +- __init__.py
  +- helper_mod.py

I need to execute the run.py (which has dependencies on helpers) inside the zip as main.我需要在 zip 中作为 main 执行run.py （它依赖于帮助程序）。 I'm running the job in client mode, and the command I tried was spark-submit --py-files find-homes.zip find_homes.run.py .我在客户端模式下运行作业，我尝试的命令是spark-submit --py-files find-homes.zip find_homes.run.py 。 The find_homes.run.py file is a thin wrapper containing the following code: find_homes.run.py文件是一个瘦包装器，包含以下代码：

import os
import importlib

def main():
    filename = os.path.basename(__file__)
    module = os.path.splitext(filename)[0]
    module = importlib.import_module(module)
    module.main()


if __name__ == '__main__':
    main()

I'm basically following the suggestion from this SO thread, but nothing is working.我基本上遵循这个SO 线程的建议，但没有任何效果。 The error it shows after launching the job is:启动作业后显示的错误是：

Traceback (most recent call last):
  File "/home/hadoop/find_homes.run.py", line 13, in <module>
    main()
  File "/home/hadoop/find_homes.run.py", line 8, in main
    module = importlib.import_module(module)
  File "/usr/lib64/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 941, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'find_homes'

I'm sort of losing patience in finding out what is it that I'm missing here.我有点失去耐心来找出我在这里缺少的东西。 None of the suggestions (including updating the PYTHONPATH with the zip location) works, so any help or even a nudge in the right direction is very much appreciated.没有任何建议（包括使用 zip 位置更新 PYTHONPATH）有效，因此非常感谢任何帮助，甚至是朝着正确方向的推动。 I'm using EMR v5.23.0 against Spark 2.4我正在对Spark 2.4使用EMR v5.23.0

Update更新

Well, something strange happened.嗯，奇怪的事情发生了。 I was using the following gradle task to generate the zip:我正在使用以下 gradle 任务来生成 zip：

task build(type: Zip) {
    from ('src/')
    into "."
    archiveName = "${project.name}.zip"
    includeEmptyDirs = false
    destinationDir new File(projectDir, 'build/distributions') 
}

I don't know how did it occur to me, but I just unzipped my artifact, and zipped it again using zip -r find_homes.zip <packages> , and then used the resulting zip with spark-submit, and it worked.我不知道它是怎么发生的，但我只是解压缩了我的工件，然后使用zip -r find_homes.zip <packages>再次压缩它，然后将生成的 zip 与 spark-submit 一起使用，它工作了。 No idea why, as the folder structures are exactly same in both cases.不知道为什么，因为两种情况下的文件夹结构完全相同。

Answer 1

For those, who are using EMR for spark jobs, I'm sharing my findings and the route I took after trying out different approaches.对于那些使用 EMR 进行 spark 工作的人，我正在分享我的发现以及我在尝试不同方法后采取的路线。 The key points are tabulated below.关键点如下表所示。

Manage Python dependencies through EMR bootstrap script.通过 EMR 引导脚本管理 Python 依赖项。 All the python packages which you have dependencies on, are required to be installed on the executors (eg pandas, sklearn etc.).您所依赖的所有python 包都需要安装在执行器上（例如pandas、sklearn 等）。 It can be done through this bootstrap script at the time of launching the cluster.可以在启动集群时通过此引导脚本完成。
Assuming you've a gradle project for Python (maybe along with other languages like Java), pygradle doesn't seem to add so much value if usecase #1 is taken care of.假设您有一个 Python 的 gradle 项目（可能还有其他语言，如 Java），如果处理好用例 #1，pygradle 似乎并没有增加太多价值。
The built-in gradle zip task won't likely work for creating a zip file with your Python modules.内置的 gradle zip 任务可能不适用于使用 Python 模块创建 zip 文件。 I added a zip creation module using Python, and invoked that in a gradle task through commandline execution.我使用 Python 添加了一个 zip 创建模块，并通过命令行执行在 gradle 任务中调用它。 So, the gradle task calls the python script with appropriate arguments to generate the zip file.因此，gradle 任务使用适当的参数调用 python 脚本来生成 zip 文件。 Make sure your packages are present at the root level of the zip.确保您的包存在于 zip 的根目录中。 Then follow the link which I shared in the question above to submit your pyspark job.然后按照我在上述问题中分享的链接提交您的 pyspark 作业。

在 AWS EMR 上的 zip 文件中提交 pyspark 应用程序

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-06-06 09:54:31

在 AWS EMR 上的 zip 文件中提交 pyspark 应用程序

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-06-06 09:54:31

解决方案1
1 已采纳 2019-06-06 09:54:31