简体繁体中英

Azure Databricks with Storage Account as data layer

原文 2021-05-26 15:06:13 5 1 azure/ databricks/ azure-databricks

I have just started working on a data analysis that requires analyzing high volume data using Azure Databricks. While planning to use Databricks notebook to analyze, I have come across different storage options to load the data a) DBFS - default file system from Databricks b) Azure Data Lake (ADLS) and c) Azure Blob Storage . Looks like the items (b) and (c) can be mounted into the workspace to retrieve the data for our analysis.

With the above understanding, may I get the following questions clarified please?

What's the difference between these storage options while using them in the context of Databricks? Do DBFS and ADLS incorporate HDFS' file management principles under the hood like breaking files into chunks, name node, data node etc?
If I mount Azure Blob Storage container to analyze the data, would I still get the same performance as other storage options? Given the fact that blob storage is an object based store, does it still break the files into blocks and load those chunks as RDD partitions into Spark executor nodes?

1 answers

DBFS is just an an abstraction on top of scalable object storage like S3 on AWS, ADLS on Azure, Google Storage on GCP.

By default when you create a workspace, you get an instance of DBFS - so-called DBFS Root . Plus you can mount additional storage accounts under the /mnt folder. Data written to mount point paths ( /mnt ) is stored outside of the DBFS root. Even though the DBFS root is writeable, It's recommended that you store data in mounted object storage rather than in the DBFS root. The DBFS root is not intended for production customer data, as there are limitations, like lack of access control, you can't access storage account mounted as DBFS Root outside of workspace, etc.

The actual implementation of the storage service like namenodes, etc. are really abstacted away - you work with HDFS-compatible API, but under the hood implementation will differ depending on the cloud and flavor of storage. For Azure, you can find some details about their implementation in this blog post .

Regarding the second question - yes, you still should get the splitting of files into chunks, etc. There are differences between Blob Storage & Data Lake Storage, especially for ADLS Gen 2 that have better security model and may better optimized for big data workloads. This blog post describes differences between them.

Editing the Azure Storage Account in Azure Databricks

Create Azure databricks notebook from storage account

Azure Storage Account file details in a table in databricks

List down all container within a storage account of azure through databricks

getting error while mounting azure storage account with Databricks file system

Azure Monitoring - no data in storage account

Unable to access a mounted Azure Data Lake storage using Azure Databricks

Unable to mount Azure Data Lake Storage Gen 2 with Azure Databricks

Reading data from Azure Blob Storage into Azure Databricks using /mnt/

Writing Data to Azure Blob Storage from Azure Databricks

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Editing the Azure Storage Account in Azure Databricks Create Azure databricks notebook from storage account Azure Storage Account file details in a table in databricks List down all container within a storage account of azure through databricks getting error while mounting azure storage account with Databricks file system Azure Monitoring - no data in storage account Unable to access a mounted Azure Data Lake storage using Azure Databricks Unable to mount Azure Data Lake Storage Gen 2 with Azure Databricks Reading data from Azure Blob Storage into Azure Databricks using /mnt/ Writing Data to Azure Blob Storage from Azure Databricks

Related Tags

Azure Databricks with Storage Account as data layer

Question

1 answers

solution1 1 ACCPTED 2021-05-27 05:57:41

solution1
1 ACCPTED 2021-05-27 05:57:41