How to make spark run all tasks in a job concurrently?

Question

I have a system where a REST API (Flask) uses spark-sumbit to send a job to an up-and-running pyspark.

For various reasons, I need spark to run all tasks at the same time (ie I need to set the number of executors = the number of tasks during runtime).

For example, if I have twenty tasks and only 4 cores, I want each core to execute 5 tasks (executors) without having to restart spark.

I know I can set the number of executors when starting spark, but I don't want to do that since spark is executing other jobs.

Is this possible to achieve through a work around?

Answer 1

Use spark scheduler pools. Here is an example for running multiple queries using scheduler pools (all the way to the end of the article, for convenience copying here), the same logic works for DStreams too: https://docs.databricks.com/spark/latest/structured-streaming/production.html

// Run streaming query1 in scheduler pool1
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df.writeStream.queryName("query1").format("parquet").start(path1)

// Run streaming query2 in scheduler pool2
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query2").format("orc").start(path2)

How to make spark run all tasks in a job concurrently?

Question

1 answers

solution1
0 2017-08-26 06:09:20

How to make spark run all tasks in a job concurrently?

Question

1 answers

solution1 0 2017-08-26 06:09:20

solution1
0 2017-08-26 06:09:20