简体   繁体   中英

Under what circumstances must I use py-files option of spark-submit?

Just poking around spark-submit, I was under the impression that if my application has got dependencies on other .py files then I have to distribute them using the py-files option (see bundling your applications dependencies ). I took that to mean any file had to be declared using py-files yet the following works fine... two .py files:

spark_submit_test_lib.py :

def do_sum(sc) :
  data = [1, 2, 3, 4, 5]
  distData = sc.parallelize(data)
  return distData.sum()

and spark_submit_test.py :

from pyspark import SparkContext, SparkConf
from spark_submit_test_lib import do_sum
conf = SparkConf().setAppName('JT_test')
sc = SparkContext(conf=conf)
print do_sum(sc)

submitted using:

spark-submit --queue 'myqueue' spark_submit_test.py

All worked fine. Code ran, yields the correct result, spark-submit terminates gracefully.
However, I would have thought having read the documentation that I would have had to do this:

spark-submit --queue 'myqueue' --py-files spark_submit_test_lib.py spark_submit_test.py

That still worked of course. I'm just wondering why the former worked as well. Any suggestions?

You must be submitting this in local environment where your driver and executors runs on the same machine , that is the the reason it worked ,but if you deploy in cluster and try to run from there you have to use --pf-files option.

Please check the link for more details

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM