Run a spark job: python vs spark.submit

Question

The common way of running a spark job appears to be using spark-submit as below ( source ):

spark-submit --py-files pyfile.py,zipfile.zip main.py --arg1 val1

Being newer to spark, I wanted to know why this first method is preferred over running it from python ( example ):

python pyfile-that-uses-pyspark.py

The former method yields many more examples when googling the topic, but not explicitly stated reasons for it. In fact, here is another Stack Overflow question where one answer, repeated below, specifically tells the OP not to use the python method, but does not give a reason why.

dont run your py file as: python filename.py instead use: spark-submit filename.py

Can someone provide insight?

Answer 1

@mint Your comment is more or less correct.

The spark-submit script in Spark's bin directory is used to launch applications on a cluster. It can use all of Spark's supported cluster managers through a uniform interface so you don't have to configure your application especially for each one.

As I understand, using python pyfile-that-uses-pyspark.py cannt launch an application on a cluster, or it's at least more difficult to do so.

Answer 2

The slightly longer answer, other than saying the Anaconda docs linked are wrong, and the official documentation never tells you to use python , is that Spark requires a JVM.

spark-submit is a wrapper around a JVM process that sets up the classpath, downloads packages, verifies some configuration, among other things. Running python bypasses this, and would have to all be re-built into pyspark/__init__.py so that those processes get ran when imported.

Run a spark job: python vs spark.submit

Question

2 answers

solution1
1 2020-02-01 18:20:23

solution2
0 2022-01-24 21:26:16

Run a spark job: python vs spark.submit

Question

2 answers

solution1 1 2020-02-01 18:20:23

solution2 0 2022-01-24 21:26:16

solution1
1 2020-02-01 18:20:23

solution2
0 2022-01-24 21:26:16