简体   繁体   English

如何在 DBFS 上挂载 Azure 数据湖存储

[英]How to mount Azure Data Lake Store on DBFS

I need to mount Azure Data Lake Store Gen1 data folders on Azure Databricks File System using Azure Service Principal Client credentials.我需要使用 Azure 服务主体客户端凭据在 Azure Databricks 文件系统上安装 Azure Data Lake Store Gen1 数据文件夹。 Please help on the same请帮忙

There are three ways of accessing Azure Data Lake Storage Gen1:访问 Azure Data Lake Storage Gen1 的三种方式:

  1. Pass your Azure Active Directory credentials, also known as credential passthrough.传递您的 Azure Active Directory 凭据,也称为凭据传递。
  2. Mount an Azure Data Lake Storage Gen1 filesystem to DBFS using a service principal and OAuth 2.0.使用服务主体和 OAuth 2.0 将 Azure Data Lake Storage Gen1 文件系统装载到 DBFS。
  3. Use a service principal directly.直接使用服务主体。

1. Pass your Azure Active Directory credentials, also known as credential passthrough: 1. 传递您的 Azure Active Directory 凭据,也称为凭据传递:

You can authenticate automatically to Azure Data Lake Storage Gen1 from Azure Databricks clusters using the same Azure Active Directory (Azure AD) identity that you use to log into Azure Databricks. You can authenticate automatically to Azure Data Lake Storage Gen1 from Azure Databricks clusters using the same Azure Active Directory (Azure AD) identity that you use to log into Azure Databricks. When you enable your cluster for Azure AD credential passthrough, commands that you run on that cluster will be able to read and write your data in Azure Data Lake Storage Gen1 without requiring you to configure service principal credentials for access to storage.当您为集群启用 Azure AD 凭据直通时,您在该集群上运行的命令将能够读取和写入 Azure Data Lake Storage Gen1 中的数据,而无需您配置服务主体凭据以访问存储。

Enable Azure Data Lake Storage credential passthrough for a standard cluster为标准集群启用 Azure Data Lake Storage 凭证直通

在此处输入图像描述

For complete setup and usage instructions, see Secure access to Azure Data Lake Storage using Azure Active Directory credential passthrough .有关完整的设置和使用说明,请参阅使用 Azure Active Directory 凭据直通安全访问 Azure Data Lake Storage

2. Mount an Azure Data Lake Storage Gen1 filesystem to DBFS using a service principal and OAuth 2.0. 2. 使用服务主体和 OAuth 2.0 将 Azure Data Lake Storage Gen1 文件系统安装到 DBFS。

Step1: Create and grant permissions to service principal步骤 1:创建并授予服务主体权限

If your selected access method requires a service principal with adequate permissions, and you do not have one, follow these steps:如果您选择的访问方法需要具有足够权限的服务主体,而您没有,请执行以下步骤:

  1. Create an Azure AD application and service principal that can access resources.创建可以访问资源的 Azure AD 应用程序和服务主体。 Note the following properties:请注意以下属性:

    application-id: An ID that uniquely identifies the client application. application-id:唯一标识客户端应用程序的 ID。

    directory-id: An ID that uniquely identifies the Azure AD instance. directory-id:唯一标识 Azure AD 实例的 ID。

    service-credential: A string that the application uses to prove its identity. service-credential:应用程序用来证明其身份的字符串。

  2. Register the service principal, granting the correct role assignment, such as Contributor, on the Azure Data Lake Storage Gen1 account.在 Azure Data Lake Storage Gen1 帐户上注册服务主体,授予正确的角色分配,例如参与者。

Step2: Mount Azure Data Lake Storage Gen1 resource using a service principal and OAuth 2.0步骤 2:使用服务主体和 OAuth 2.0 装载 Azure Data Lake Storage Gen1 资源

Python code: Python代码:

configs = {"<prefix>.oauth2.access.token.provider.type": "ClientCredential",
           "<prefix>.oauth2.client.id": "<application-id>",
           "<prefix>.oauth2.credential": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
           "<prefix>.oauth2.refresh.url": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

在此处输入图像描述

3. Access directly with Spark APIs using a service principal and OAuth 2.0 3. 使用服务主体和 OAuth 2.0 直接使用 Spark API 访问

You can access an Azure Data Lake Storage Gen1 storage account directly (as opposed to mounting with DBFS) with OAuth 2.0 using the service principal.您可以使用服务主体通过 OAuth 2.0 直接访问 Azure Data Lake Storage Gen1 存储帐户(而不是使用 DBFS 安装)。

Access using the DataFrame API:使用 DataFrame API 访问:

To read from your Azure Data Lake Storage Gen1 account, you can configure Spark to use service credentials with the following snippet in your notebook:要从 Azure Data Lake Storage Gen1 帐户中读取数据,您可以将 Spark 配置为使用笔记本中的以下代码段的服务凭证:

spark.conf.set("<prefix>.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("<prefix>.oauth2.client.id", "<application-id>")
spark.conf.set("<prefix>.oauth2.credential","<key-name-for-service-credential>"))
spark.conf.set("<prefix>.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

在此处输入图像描述

Reference: Azure Databricks - Azure Data Lake Storage Gen1参考: Azure Databricks - Azure Data Lake Storage Gen1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM