简体   繁体   中英

Submitting pyspark job with zip file

I have four python files, out of four files 1 file has spark entry code defined and that file drives and calls rest other python files. for now I have provided four python files with --py-files option in spark submit command, but instead of submitting this way I want to create zip file and pack these all four python files and submit with spark-submit. though I tried to search a bit and came to know i can create zip file let's say myfile.zip and pack all four python files into this zip file and submit spark job with --py-file myfile.zip. But with this approach I have also seen in multiple places I need to add one line of code sc.addfile(ziplfilepath) in main python file. if I need to add sc.addfile(), in this case I do not want to provide any path to the program that is reason I am submitting job mentioning all files instead zip file. my question is: is it require to add sc.addfile() function and provide path of zip file or while submitting job with --py-file myfile.zip will alone work.

spark submit command I am using now:

'Args': ['spark-submit',
                         '--deploy-mode', 'cluster', '--master', 'yarn','--executor-memory',conf['emr_step_executor_memory'], '--executor-cores',conf['emr_step_executor_cores'],
                         '--conf','spark.yarn.submit.waitAppCompletion=true','--conf',
                         'spark.rpc.message.maxSize=1024',
                       '--py-files',
                         f'{s3_path}/file2.py,
                         {s3_path}/file3.py,
                         {s3_path}/file4.py',
                         {s3_path}/mainfile.py
                         ]

spark-submit command with zip:

'Args': ['spark-submit',
                         '--deploy-mode', 'cluster', '--master', 'yarn','--executor-memory',conf['emr_step_executor_memory'], '--executor-cores',conf['emr_step_executor_cores'],
                         '--conf','spark.yarn.submit.waitAppCompletion=true','--conf',
                         'spark.rpc.message.maxSize=1024',
                       '--py-files',
                         f'{s3_path}/myzipfile.zip,
                        
                         f'{s3_path}/mainfile.py
                         ]

would above spark submit command work if in mainfile.py I do not add sc.addfile() function?

If you are specifically specifying zip file using --py-files then you don't have to specify in program.

Please refer https://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM