简体繁体 English

Azure Databricks 以存储帐户作为数据层

[英]Azure Databricks with Storage Account as data layer

原文 2021-05-26 15:06:13 1 1 azure/ databricks/ azure-databricks

I have just started working on a data analysis that requires analyzing high volume data using Azure Databricks.我刚刚开始进行需要使用 Azure Databricks 分析大量数据的数据分析。 While planning to use Databricks notebook to analyze, I have come across different storage options to load the data a) DBFS - default file system from Databricks b) Azure Data Lake (ADLS) and c) Azure Blob Storage .在计划使用 Databricks 笔记本进行分析时，我遇到了不同的存储选项来加载数据 a) DBFS - Databricks 的默认文件系统 b) Azure Data Lake (ADLS) 和 c) Azure Blob Storage 。 Looks like the items (b) and (c) can be mounted into the workspace to retrieve the data for our analysis.看起来项目 (b) 和 (c) 可以安装到工作区中以检索数据以供我们分析。

With the above understanding, may I get the following questions clarified please?有了以上的了解，请问我以下问题可以澄清吗？

What's the difference between these storage options while using them in the context of Databricks?在 Databricks 上下文中使用这些存储选项时，它们之间有什么区别？ Do DBFS and ADLS incorporate HDFS' file management principles under the hood like breaking files into chunks, name node, data node etc? DBFS 和 ADLS 是否包含 HDFS 的文件管理原则，例如将文件分成块、名称节点、数据节点等？
If I mount Azure Blob Storage container to analyze the data, would I still get the same performance as other storage options?如果我挂载 Azure Blob 存储容器来分析数据，我仍然可以获得与其他存储选项相同的性能吗？ Given the fact that blob storage is an object based store, does it still break the files into blocks and load those chunks as RDD partitions into Spark executor nodes?鉴于 Blob 存储是基于 object 的存储，它是否仍将文件分成块并将这些块作为 RDD 分区加载到 Spark 执行器节点中？

1 个解决方案

DBFS is just an an abstraction on top of scalable object storage like S3 on AWS, ADLS on Azure, Google Storage on GCP. DBFS 只是可扩展 object 存储（如 AWS 上的 S3、Azure 上的 ADLS、GCP 上的 Google 存储）之上的抽象。

By default when you create a workspace, you get an instance of DBFS - so-called DBFS Root .默认情况下，当您创建工作空间时，您会获得一个 DBFS 实例——所谓的DBFS Root 。 Plus you can mount additional storage accounts under the /mnt folder.另外，您可以在/mnt文件夹下挂载其他存储帐户。 Data written to mount point paths ( /mnt ) is stored outside of the DBFS root.写入挂载点路径 ( /mnt ) 的数据存储在 DBFS 根目录之外。 Even though the DBFS root is writeable, It's recommended that you store data in mounted object storage rather than in the DBFS root.即使 DBFS 根是可写的，建议您将数据存储在挂载的 object 存储中，而不是存储在 DBFS 根中。 The DBFS root is not intended for production customer data, as there are limitations, like lack of access control, you can't access storage account mounted as DBFS Root outside of workspace, etc. DBFS 根不适用于生产客户数据，因为存在一些限制，例如缺乏访问控制，您无法访问作为 DBFS 根安装在工作区之外的存储帐户等。

The actual implementation of the storage service like namenodes, etc. are really abstacted away - you work with HDFS-compatible API, but under the hood implementation will differ depending on the cloud and flavor of storage.存储服务（如名称节点等）的实际实现确实被忽略了——您使用与 HDFS 兼容的 API，但在底层实现会因云和存储风格而异。 For Azure, you can find some details about their implementation in this blog post .对于 Azure，您可以在此博客文章中找到有关其实现的一些详细信息。

Regarding the second question - yes, you still should get the splitting of files into chunks, etc. There are differences between Blob Storage & Data Lake Storage, especially for ADLS Gen 2 that have better security model and may better optimized for big data workloads.关于第二个问题 - 是的，您仍然应该将文件拆分为块等。Blob 存储和数据湖存储之间存在差异，特别是对于具有更好安全性 model 并且可能更好地针对大数据工作负载进行优化的 ADLS Gen 2。 This blog post describes differences between them. 这篇博文描述了它们之间的差异。