![](/img/trans.png)
[英]How can I read an Azure Blob Storage file direclty from an Azure Databricks Notebook
[英]How to read data into a databricks notebook from Azure blob using Azure Active Directory (AAD)
我正在嘗試將一些容器中的數據讀取到我的筆記本中,並將它們寫入 spark 或 pandas dataframe 的格式。 有一些關於使用帳戶密碼的文檔,但是如何使用 Azure Active Directory 來做到這一點?
不幸的是,這些是 Databricks 中支持的用於訪問 Azure Blob 存儲的方法:
參考: Databricks - Azure Blob 存儲
希望這可以幫助。
關於使用 Azure AD 訪問 Azure Blob 的 Azure 官方文檔有幾個,如下所示。
Meanwile,這是我的示例代碼,用於獲取 Azure 存儲帳戶的密鑰(帳戶密碼),以便在數據塊中使用它。
from azure.common.credentials import ServicePrincipalCredentials
from azure.mgmt.storage import StorageManagementClient
# Please refer to the second document above to get these parameter values
credentials = ServicePrincipalCredentials(
client_id='<your client id>',
secret='<your client secret>',
tenant='<your tenant id>'
)
subscription_id = '<your subscription id>'
client = StorageManagementClient(credentials, subscription_id)
resource_group_name = '<the resource group name of your storage account>'
account_name = '<your storage account name>'
# print(dir(client.storage_accounts))
keys_json_text = client.storage_accounts.list_keys(resource_group_name, account_name, raw=True).response.text
import json
keys_json = json.loads(keys_json_text)
# print(keys_json)
# {"keys":[{"keyName":"key1","value":"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx==","permissions":"FULL"},{"keyName":"key2","value":"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx==","permissions":"FULL"}]}'
key1 = keys_json['keys'][0]['value']
print(key1)
然后,可以使用上面的賬號密碼按照Azure Databricks官方文檔Data > Data Sources > Azure Blob Storage讀取數據。
否則,您可以參考我的回答的步驟 1 和 2, 獲取 azure 數據工廠中的其他 SO 線程轉換數據,使用 python 數據塊讀取數據,如下面的代碼。
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import ContainerPermissions
from datetime import datetime, timedelta
account_name = '<your account name>'
account_key = '<your account key>' # the key comes from the code above
container_name = '<your container name>'
service = BaseBlobService(account_name=account_name, account_key=account_key)
token = service.generate_container_shared_access_signature(container_name, permission=ContainerPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)
blob_name = '<your blob name of dataset>'
blob_url_with_token = f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}?{token}"
import pandas as pd
pdf = pd.read_json(blob_url_with_token)
df = spark.createDataFrame(pdf)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.