I am trying to read data from some containers into my notebook and write them into the format of spark or pandas dataframe. There are some documentations about using account password, but how can I do it with Azure Active Directory?
Unfortunately, these are the supported methods in Databricks for accessing Azure Blob Storage:
Reference: Databricks - Azure Blob Storage
Hope this helps.
There is several Azure offical documents about accessing Azure Blob using Azure AD, as below.
Meanwile, here is my sample code to get the key (account password) of an Azure Storage account for using it in databricks.
from azure.common.credentials import ServicePrincipalCredentials
from azure.mgmt.storage import StorageManagementClient
# Please refer to the second document above to get these parameter values
credentials = ServicePrincipalCredentials(
client_id='<your client id>',
secret='<your client secret>',
tenant='<your tenant id>'
)
subscription_id = '<your subscription id>'
client = StorageManagementClient(credentials, subscription_id)
resource_group_name = '<the resource group name of your storage account>'
account_name = '<your storage account name>'
# print(dir(client.storage_accounts))
keys_json_text = client.storage_accounts.list_keys(resource_group_name, account_name, raw=True).response.text
import json
keys_json = json.loads(keys_json_text)
# print(keys_json)
# {"keys":[{"keyName":"key1","value":"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx==","permissions":"FULL"},{"keyName":"key2","value":"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx==","permissions":"FULL"}]}'
key1 = keys_json['keys'][0]['value']
print(key1)
Then, you can use the account password above to follow the Azure Databricks offical document Data > Data Sources > Azure Blob Storage to read data.
Otherwise, you can refer to the Steps 1 & 2 of my answer for the other SO thread transform data in azure data factory using python data bricks to read data, as the code below.
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import ContainerPermissions
from datetime import datetime, timedelta
account_name = '<your account name>'
account_key = '<your account key>' # the key comes from the code above
container_name = '<your container name>'
service = BaseBlobService(account_name=account_name, account_key=account_key)
token = service.generate_container_shared_access_signature(container_name, permission=ContainerPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)
blob_name = '<your blob name of dataset>'
blob_url_with_token = f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}?{token}"
import pandas as pd
pdf = pd.read_json(blob_url_with_token)
df = spark.createDataFrame(pdf)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.