简体   繁体   中英

Submitting a Pyspark job with multiple files in AWS EMR

I have a pyspark job that is distributed in multiple code files in this structure:

flexible_clendar
   - Cache
       - redis_main.py
   - Helpers
       - helpers.py
   - Spark
       - spark_main.py
   - main.py

In the 'main.py' I'm using the functions from 'helpers.py', 'redis_main.py', etc...

The 'flexible_calendar' folder is uploaded in S3 bucket, so that the EMR could run the code from it.

Iv'e created an EMR cluster that is bootstraped with all the needed packages and it is working if I'm running a simple-one file code (from s3) with all the functions in it:

在此处输入图像描述

The problem is when I'm trying to use the distributed file structure the code fails, because it doesn't recognize the files from 'helpers.py', 'spark_main', etc... like so:

在此处输入图像描述

I've tried multiple configurations in the 'Step Arguments' field which none of them worked, such as:

Arguments: spark-submit --deploy-mode cluster s3://flexible-calendar/flexible-calendar-emr
Arguments: spark-submit --deploy-mode cluster s3://flexible-calendar/flexible-calendar-emr/Cache/redis_main.py s3://flexible-calendar/flexible-calendar-emr/Helpers/helpers.py s3://flexible-calendar/flexible-calendar-emr/Spark/spark_main.py s3://flexible-calendar/flexible-calendar-emr/main.py
Arguments: spark-submit --deploy-mode cluster --class s3://flexible-calendar/flexible-calendar-emr s3://flexible-calendar/flexible-calendar-emr/main.py
Arguments: spark-submit --deploy-mode cluster --class s3://flexible-calendar/main_one.py

Also:

Arguments: spark-submit --py-files s3://flexible-calendar/flexible-calendar-emr.zip
Arguments: spark-submit --deploy-mode --py-files s3://flexible-calendar/flexible-calendar-emr.zip
Arguments: spark-submit --py-files s3://flexible-calendar/flexible-calendar-emr.zip --deploy-mode cluster s3://flexible-calendar/flexible-calendar-emr/Spark/spark_main.py
Arguments: spark-submit --deploy-mode cluster s3://flexible-calendar/flexible-calendar-emr/Spark/spark_main.py --py-files s3://flexible-calendar/flexible-calendar-emr.zip

and more...

Hope someone could help,

Thanks.

Quoting from Spark Documentation :

For Python, you can use the --py-files argument of spark-submit to add.py, .zip or.egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a.zip or.egg.

So the zip file you have created needs to be added to the sys path inside the main.py.

if os.path.exists('flexible-calendar-emr.zip'):
    sys.path.insert(0, 'flexible-calendar-emr.zip')

Let me know of this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM