简体   繁体   English

如何使Spark同时运行作业中的所有任务?

[英]How to make spark run all tasks in a job concurrently?

I have a system where a REST API (Flask) uses spark-sumbit to send a job to an up-and-running pyspark. 我有一个REST API(Flask)使用spark-sumbit将作业发送到正在运行的pyspark的系统。

For various reasons, I need spark to run all tasks at the same time (ie I need to set the number of executors = the number of tasks during runtime). 由于各种原因,我需要火花来同时运行所有任务(即,我需要设置执行程序的数量=运行时的任务数量)。

For example, if I have twenty tasks and only 4 cores, I want each core to execute 5 tasks (executors) without having to restart spark. 例如,如果我有20个任务并且只有4个内核,则希望每个内核执行5个任务(执行程序)而不必重新启动spark。

I know I can set the number of executors when starting spark, but I don't want to do that since spark is executing other jobs. 我知道我可以在启动spark时设置执行程序的数量,但是我不想这样做,因为spark正在执行其他作业。

Is this possible to achieve through a work around? 是否有可能通过解决来实现?

Use spark scheduler pools. 使用火花调度程序池。 Here is an example for running multiple queries using scheduler pools (all the way to the end of the article, for convenience copying here), the same logic works for DStreams too: https://docs.databricks.com/spark/latest/structured-streaming/production.html 这是一个使用调度程序池运行多个查询的示例(一直到本文结尾,为方便起见,在此处进行复制),同样的逻辑也适用于DStreams: https ://docs.databricks.com/spark/latest/ 结构化流/production.html

// Run streaming query1 in scheduler pool1
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df.writeStream.queryName("query1").format("parquet").start(path1)

// Run streaming query2 in scheduler pool2
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query2").format("orc").start(path2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM