Azure Synapse Notebook code to retrieve spark pool tags

Question

When running a Pyspark notebook interactively or in a pipeline, how to retrieve the Spark Pool tags? Please provide code example. Thx

Answer 1

The answer to this question is not simple since Spark is open source code and Azure object tags is web services code.

I will walk you thru my thought process and how I solved this problem.

First, the spark session contains the name of the cluster in which the notebook is running under in Synapse. The following code retrieves this name.

%%pyspark

#
# Get spark pool name
#

# Import library
from pyspark.context import SparkContext

# Create context
sc = SparkContext.getOrCreate()

# Get configuration
tuples = sc.getConf().getAll()

# Find spark pool name
for element in tuples:
    if element[0].find('spark.synapse.pool.name') != -1:
        print (element[0])
        print (element[1])
        print ("")

Here is the output from the execution.

The next task is to add a tag to the existing spark cluster. My tag is called "stack_overflow_question" and the answer is "yes". This is the key value pair.

Since the spark context does not contain this tag information, we have to move to Azure tools to get this information.

The Azure Command Line Interface is one step up from a REST API call. I am going to do a quick test to make sure the list command returns the information that I want.

We can see that using a REST API call will work.

1 - We need to create a service principle that has access to Microsoft Graph, user read privilege's. I am adding two MSDN links to accomplish this task.

https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal

https://docs.microsoft.com/en-us/graph/migrate-azure-ad-graph-configure-permissions?tabs=powershell

2 - We need to write code to login into Azure using the service principle and return an access token (bearer certificate).

%%pyspark

#
# 2 - Get access token
#

# Import library
import adal

# Key information (parameters)
tenant_id = 'your tenant id'
client_id = 'your client id'
client_secret = 'your client secret'
subscription_id = 'your subscription id'

# Microsoft login url
authority_url = 'https://login.microsoftonline.com/' + tenant_id
context = adal.AuthenticationContext(authority_url)

# Ask for access token
token = context.acquire_token_with_client_credentials(
    resource = 'https://management.azure.com/',
    client_id = client_id,
    client_secret = client_secret
)

# Show token
print(token["accessToken"])

If everything works correctly, you should get a large string of characters back. I am only showing a portion to show it worked.

3 - The last step is to create an REST API call to return the information that we want. The code below does just that. I am including the MSDN reference to the API.

https://docs.microsoft.com/en-us/rest/api/synapse/big-data-pools

%%pyspark

#
# 3 - List pool properties
#

# libraries
import requests
import json

# azure object info
sub_id = "your subscription id"
rg_name = "rg4synapse"
ws_name = "wsn4synapse"
sp_name = "asp4synapse"

# management url
url = ""
url += "https://management.azure.com/subscriptions/{}/".format(sub_id)
url += "resourceGroups/{}/providers/Microsoft.Synapse/".format(rg_name)
url += "workspaces/{}/".format(ws_name)
url += "bigDataPools/{}".format(sp_name)

# access token + api version
headers = {'Authorization': 'Bearer ' + token['accessToken'], 'Content-Type': 'application/json'}
params = {'api-version': '2021-06-01'}

# make rest api call
r = requests.get(url, headers=headers, params=params)

# show the results
print(json.dumps(r.json(), indent=4, separators=(',', ': ')))

I chose to place the resulting JSON document in this post as code. It is a-lot easier to see the whole string.

{
    "properties": {
        "creationDate": "2021-09-13T19:46:27.95Z",
        "sparkVersion": "2.4",
        "nodeCount": 3,
        "nodeSize": "Small",
        "nodeSizeFamily": "MemoryOptimized",
        "autoScale": {
            "enabled": false,
            "minNodeCount": 3,
            "maxNodeCount": 3
        },
        "autoPause": {
            "enabled": true,
            "delayInMinutes": 15
        },
        "isComputeIsolationEnabled": false,
        "sessionLevelPackagesEnabled": true,
        "cacheSize": 0,
        "dynamicExecutorAllocation": {
            "enabled": false
        },
        "lastSucceededTimestamp": "2022-09-04T18:35:54.55Z",
        "isAutotuneEnabled": false,
        "provisioningState": "Succeeded"
    },
    "id": "/subscriptions/792f5db5-2798-4365-ba7b-e5812052a8d0/resourceGroups/rg4synapse/providers/Microsoft.Synapse/workspaces/wsn4synapse/bigDataPools/asp4synapse",
    "name": "asp4synapse",
    "type": "Microsoft.Synapse/workspaces/bigDataPools",
    "location": "eastus2",
    "tags": {
        "spark_overflow_question": "yes"
    }
}

Lets review the steps to make this happen.

1 - Use the spark session to identify which cluster is being used by the notebook.

2 - Have a service principle defined with access to read Microsoft Graph.

3 - Login to Azure using the service principle to grab an access token.

4 - Make the rest API call with the access token and cluster name to return tag properties.

In short, this solves your problem.

I leave parsing the JSON document with you. Just a hint, look at this link.

https://www.geeksforgeeks.org/json-loads-in-python/

Azure Synapse Notebook code to retrieve spark pool tags

Question

1 answers

solution1
0 2022-09-08 14:52:50

Azure Synapse Notebook code to retrieve spark pool tags

Question

1 answers

solution1 0 2022-09-08 14:52:50

solution1
0 2022-09-08 14:52:50