Submitting pyspark app inside zip file on AWS EMR

Question

We've a pyspark-jobs repository which contains the zip artifacts in S3 after build process pushes it there. Let's say one such job is find-homes.zip whose contents are shown below:

find-homes.zip
+-find_homes
  +- __init__.py
  +- run.py
+-helpers
  +- __init__.py
  +- helper_mod.py

I need to execute the run.py (which has dependencies on helpers) inside the zip as main. I'm running the job in client mode, and the command I tried was spark-submit --py-files find-homes.zip find_homes.run.py . The find_homes.run.py file is a thin wrapper containing the following code:

import os
import importlib

def main():
    filename = os.path.basename(__file__)
    module = os.path.splitext(filename)[0]
    module = importlib.import_module(module)
    module.main()


if __name__ == '__main__':
    main()

I'm basically following the suggestion from this SO thread, but nothing is working. The error it shows after launching the job is:

Traceback (most recent call last):
  File "/home/hadoop/find_homes.run.py", line 13, in <module>
    main()
  File "/home/hadoop/find_homes.run.py", line 8, in main
    module = importlib.import_module(module)
  File "/usr/lib64/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 941, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'find_homes'

I'm sort of losing patience in finding out what is it that I'm missing here. None of the suggestions (including updating the PYTHONPATH with the zip location) works, so any help or even a nudge in the right direction is very much appreciated. I'm using EMR v5.23.0 against Spark 2.4

Update

Well, something strange happened. I was using the following gradle task to generate the zip:

task build(type: Zip) {
    from ('src/')
    into "."
    archiveName = "${project.name}.zip"
    includeEmptyDirs = false
    destinationDir new File(projectDir, 'build/distributions') 
}

I don't know how did it occur to me, but I just unzipped my artifact, and zipped it again using zip -r find_homes.zip <packages> , and then used the resulting zip with spark-submit, and it worked. No idea why, as the folder structures are exactly same in both cases.

Answer 1

For those, who are using EMR for spark jobs, I'm sharing my findings and the route I took after trying out different approaches. The key points are tabulated below.

Manage Python dependencies through EMR bootstrap script. All the python packages which you have dependencies on, are required to be installed on the executors (eg pandas, sklearn etc.). It can be done through this bootstrap script at the time of launching the cluster.
Assuming you've a gradle project for Python (maybe along with other languages like Java), pygradle doesn't seem to add so much value if usecase #1 is taken care of.
The built-in gradle zip task won't likely work for creating a zip file with your Python modules. I added a zip creation module using Python, and invoked that in a gradle task through commandline execution. So, the gradle task calls the python script with appropriate arguments to generate the zip file. Make sure your packages are present at the root level of the zip. Then follow the link which I shared in the question above to submit your pyspark job.

Submitting pyspark app inside zip file on AWS EMR

Question

1 answers

solution1
1 ACCPTED 2019-06-06 09:54:31

Submitting pyspark app inside zip file on AWS EMR

Question

1 answers

solution1 1 ACCPTED 2019-06-06 09:54:31

solution1
1 ACCPTED 2019-06-06 09:54:31