简体繁体中英

Cluster Resource Usage in Databricks

原文 2021-11-03 19:15:19 8 1 pyspark/ databricks/ azure-databricks

I was just wondering if anyone could explain if all compute resources in a Databricks cluster are shared or if the resources are tied to each worker. For example, if two users were connected to a cluster made up of 2 workers with 4 cores per worker and one user's job required 2 cores and the other's required 6 cores, would they be able to share the 8 total cores or would the full 4 cores from one worker be unavailable during the job that only required 2 cores?

1 answers

TL;DR; Yes, default behavior is to allow sharing but you're going to have to tightly control the default parallelism with such a small cluster.

Take a look at Job Scheduling for Apache Spark. I'm assuming you are using an "all-purpose" / "interactive" cluster where users are working on notebooks OR you are submitting jobs to an existing, all-purpose cluster and it is NOT a job cluster with multiple spark applications being deployed.

Databricks Runs in FAIR Scheduling Mode by Default

Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.

Apache Spark Defaults to FIFO

By default, Spark's scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (eg map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don't need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

Keep in mind the word "job" is specific Spark term that represents an action being taken that launches one or more stages and tasks. See What is the concept of application, job, stage and task in spark? .

So in your example you have...

2 Workers with 4 cores each == 8 cores == 8 tasks can be handled in parallel
One application (App A) that has a job that launches a stage with only 2 tasks.
One application (App B) that has a job that launches a stage with 6 tasks.

In this case, YES, you will be able to share the resources of the cluster . However, the devil is in the default behaviors. If you're reading from many files, performing a join, aggregating, etc, you're going to run into the fact that Spark is going to partition your data into chunks that can be acted on in parallel (see configuration like spark.default.parallelism ).

So, in a more realistic example, you're going to have...

2 Workers with 4 cores each == 8 cores == 8 tasks can be handled in parallel
One application (App A) that has a job that launches a stage with 200 tasks .
One application (App B) that has a job that launches three stage with 8, 200, and 1 tasks respectively .

In a scenario like this FIFO scheduling, as is the default, will result in one of these applications blocking the other since the number of executors is completely overwhelmed by the number of tasks in just one stage.

In a FAIR scheduling mode, there will still be some blocking since the number of executors is small but some work will be done on each job since FAIR scheduling does a round-robin at the task level.

In Apache Spark, you have tighter control by creating different pools of the resources and submitting apps only to those pools where they have "isolated" resources. The "better" way of doing this is with Databricks Job clusters that have isolated compute dedicated to the application being ran.

databricks Job cluster output Limits

Working with hdf files in Databricks cluster

How to properly check resource usage of AWS EMR cluster(master and cores) from notebook

Azure databricks cluster error in a spark job : ExecutorLostFailure

Connect Databricks cluster with local machine (AWS)

Is it possible to install a Databricks notebook into a cluster similarly to a library?

How databricks do auto scaling for a cluster

Handling serverless and singleNode in same Databricks Cluster Policy

Spark cluster idle most of the time - Databricks

Get Databricks cluster ID (or get cluster link) in a Spark job

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question databricks Job cluster output Limits Working with hdf files in Databricks cluster How to properly check resource usage of AWS EMR cluster(master and cores) from notebook Azure databricks cluster error in a spark job : ExecutorLostFailure Connect Databricks cluster with local machine (AWS) Is it possible to install a Databricks notebook into a cluster similarly to a library? How databricks do auto scaling for a cluster Handling serverless and singleNode in same Databricks Cluster Policy Spark cluster idle most of the time - Databricks Get Databricks cluster ID (or get cluster link) in a Spark job

Related Tags

Cluster Resource Usage in Databricks

Question

1 answers

solution1 0 2021-11-06 01:00:55

Databricks Runs in FAIR Scheduling Mode by Default

Apache Spark Defaults to FIFO

solution1
0 2021-11-06 01:00:55