简体   繁体   中英

How to create a Spot instance - job cluster using Azure Data Factory(ADF) - Linked service

I have a ADF pipeline with a Databricks activity.

The activity creates a new job cluster every time and I have added all the required Spark configurations to a corresponding linked service.

Now with Databricks offering Spot Instances, I'd like to create my new clusters with Spot configurations within Databricks.

I tried to find the help from the LinkedService docs but no luck!

How can I do this using ADF?

Cheers!!!

I have found another workaround to enable the ADF Databricks Linked Service to create job clusters with spot instances. As Alex Ott mentioned , the azure_attribute cluster property isn't supported by the Databricks Linked Service interface.

Instead, I ended up creating a cluster policy that enforces spot instances:

{
  "azure_attributes.availability": {
    "type": "fixed",
    "value": "SPOT_WITH_FALLBACK_AZURE",
    "hidden": true
  }
}

You can add to that policy if you want to augment the other properties of the azure_attributes object. Also, make sure you set the policy permissions for the appropriate groups/users.

After creating the policy you will need to retrieve the policy id. I used a REST call to the 2.0/policies/clusters/list endpoint to get that value.

From there you can do what Alex Ott suggested and create the linked service using the dynamic json option and add the policyId property with the appropriate policy id to the typeProperties object:

"typeProperties": {
  "domain": "Your Domain",
  "newClusterNodeType": "@linkedService().ClusterNodeType",
  "newClusterNumOfWorker": "@linkedService().NumWorkers",
  "newClusterVersion": "7.3.x-scala2.12",
  "newClusterInitScripts": [],
  "newClusterDriverNodeType": "@linkedService().DriverNodeType",
  "policyId": "Your policy id",
}

Now when you invoke your ADF pipeline it will create a job cluster using the cluster policy to restrict the availability property of azure_attributes to whatever you specified.

I'm not sure that it's possible right now as it requires specification of the azure_attributes parameters when creating the cluster. But there should be a workaround - create an instance pool of the spot instances and specify that pool via instancePoolId property .

Update : it really works, the only drawback is that you need to use JSON to configure Linked Service (but it's possible to configure everything visually, save, and grab JSON from Git repository and update it with required parameters). So basic steps are following:

  • Configure instance pool to use spot instances:

在此处输入图像描述

  • Configure Databricks linked service to use the instance pool:
{
    "name": "DBName",
    "type": "Microsoft.DataFactory/factories/linkedservices",
    "properties": {
    "annotations": [],
    "type": "AzureDatabricks",
    "typeProperties": {
        "domain": "https://some-url.azuredatabricks.net",
        "newClusterNodeType": "Standard_DS3_v2",
        "newClusterNumOfWorker": "5",
        "instancePoolId":"<your-pool-id>",
        "newClusterSparkEnvVars": {
        "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
        },
        "newClusterVersion": "8.2.x-scala2.12",
        "newClusterInitScripts": [],
        "encryptedCredential": "some-base-64"
    }
    }
}
  • Configure an ADF pipeline with job to execute - just as usual

  • Trigger ADF pipeline and after several minutes see that instance pool is used:

在此处输入图像描述

Please use ADF linked service option shown below to create a Spot Instance

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM