简体繁体 English

用于结构化数据的 Azure 数据湖

[英]Azure Data Lake for Structured Data

原文 2020-02-05 16:44:17 2 2 azure/ azure-data-lake

We've been reviewing the Modern Data Warehouse architectures from Microsoft (link here) , which references using Azure Data Factory to pull structured and unstructured data into the Azure Data Lake.我们一直在审查 Microsoft 的现代数据仓库架构（链接在此），其中提到使用 Azure 数据工厂将结构化和非结构化数据拉入 Azure 数据湖。 I've attended a lot of presentations on the subject as well, but most people are split on whether the Data Lake is a good home for structured data.我也参加了很多关于这个主题的演讲，但大多数人在数据湖是否适合结构化数据的问题上存在分歧。 What I am trying to determine is if importing data into the Data Lake is a good strategy if the only source we will be utilizing is on-prem SQL Server databases?我想确定的是，如果我们将使用的唯一来源是本地 SQL Server 数据库，那么将数据导入数据湖是否是一个好策略？ And, what would be the advantage / disadvantages of that strategy?而且，该策略的优点/缺点是什么？

For context sake, we're looking for a single pane of glass for consumption - whether it's end user's reporting with Power BI, or fodder for Azure Data Warehouse / on-prem Data Warehouse.就上下文而言，我们正在寻找用于消费的单一管理平台 - 无论是使用 Power BI 的最终用户报告，还是 Azure 数据仓库/本地数据仓库的素材。 We want one container that is the source for all of these systems, which is not the source OLTP system (ie OLTP database --> (Azure Data Factory) --> Data Lake --> everything else).我们想要一个容器作为所有这些系统的源，它不是源 OLTP 系统（即 OLTP 数据库 -->（Azure 数据工厂）--> 数据湖 --> 其他一切）。

I appreciate any guidance on the subject.我感谢有关该主题的任何指导。 Thank you.谢谢你。

2 个解决方案

You have not mentioned the data size and I think for moving to ADL , the data is a very strong parameter .您没有提到数据大小，我认为对于转向 ADL，数据是一个非常重要的参数。 In your case the data is very much structured .在您的情况下，数据非常结构化。 If you we had unstructured & massive data and if you wanted to use ADB or Hadoop or any other technology to process it later , i think ADL is a good candidate .如果您拥有非结构化和海量数据，并且您想稍后使用 ADB 或 Hadoop 或任何其他技术来处理它，我认为 ADL 是一个不错的选择。

You should also consider that the data is encrypted in motion using SSL .You can authorize users and groups with fine-grained POSIX-based ACLs for all data in the Store enabling role-based access controls .您还应该考虑使用 SSL 动态加密数据。您可以使用基于 POSIX 的细粒度 ACL 为存储中的所有数据授权用户和组，从而启用基于角色的访问控制。

The only real value in taking stuctured data, flattening it and loading it into a data lake is to save cost and decouple the data from any proprietary tool/compute.获取结构化数据、将其扁平化并将其加载到数据湖中的唯一真正价值是节省成本并将数据与任何专有工具/计算分离。 In your scenario, it will be less expensive to store the data in a data lake store vs. Azure SQL Database.在您的方案中，将数据存储在数据湖存储中比 Azure SQL 数据库更便宜。

However, there is a complexity cost to flattening the data.然而，扁平化数据存在复杂性成本。 You will need to restructure the data (ie. load it back into a database, or wrap logical structure) when you need to consume the data.当您需要使用数据时，您将需要重构数据（即，将其加载回数据库，或包装逻辑结构）。 Formats such as Parquet will help with this, but it is more complex for users to query data in a datalake than it is to connect to a relational database. Parquet 等格式将对此有所帮助，但用户在数据湖中查询数据比连接到关系数据库更复杂。 Most all analysts and data consumers will know how to query a relational database, especially if the data is already in SQL Server.大多数分析师和数据消费者都知道如何查询关系数据库，尤其是当数据已经在 SQL Server 中时。

Look at the volume of data and use cases for consumption to make that decision.查看数据量和使用案例以做出该决定。 A "logical datalake" can include both structured data in a relational database, semi structured data flattened in a storage account, and unstructured data saved to a storage account. “逻辑数据湖”可以包括关系数据库中的结构化数据、存储帐户中扁平化的半结构化数据以及保存到存储帐户的非结构化数据。