简体   繁体   English

Azure Data Lake - HDInsight与数据仓库

[英]Azure Data Lake - HDInsight vs Data Warehouse

I'm in a position where we're reading from our Azure Data Lake using external tables in Azure Data Warehouse. 我正处于使用Azure数据仓库中的外部表从Azure Data Lake读取的位置。

This enables us to read from the data lake, using well known SQL. 这使我们能够使用众所周知的SQL从数据湖中读取数据。

However, another option is using Data Lake Analytics, or some variation of HDInsight. 但是,另一种选择是使用Data Lake Analytics或HDInsight的某些变体。

Performance wise, I'm not seeing much difference. 表现明智,我没有看到太多差异。 I assume Data Warehouse is running some form of distributed query in the background, converting to U-SQL(?), and so why would we use Data Lake Analytics with the slightly different syntax of U-SQL? 我假设数据仓库在后台运行某种形式的分布式查询,转换为U-SQL(?),那么我们为什么要使用稍微不同的U-SQL语法的Data Lake Analytics?

With python script also available in SQL, I feel I'm missing a key purpose of Data Lake Analytics, other than the cost (pay per batch job, rather than constant up time of a database). 由于SQL中也提供了python脚本,我觉得我缺少Data Lake Analytics的主要目的,除了成本(按批次工作付费,而不是数据库的恒定运行时间)。

If your main purpose is to query data stored in the Azure Data Warehouse (ADW) then there is not real benefit to using Azure Data Lake Analytics (ADLA). 如果您的主要目的是查询存储在Azure数据仓库(ADW)中的数据,那么使用Azure Data Lake Analytics(ADLA)并没有什么好处。 But as soon as you have other (un)structured data stored in ADLS, like json documents or csv files for example, the benefit of ADLA becomes clear as U-Sql allows you to join your relational data stored in ADW with the (un)structured / nosql data stored in ADLS. 但是,只要您将其他(非)结构化数据存储在ADLS中,例如json文档或csv文件,ADLA的好处就变得清晰,因为U-Sql允许您使用(un)将存储在ADW中的关系数据加入存储在ADLS中的结构化/ nosql数据。

Also, it enables you to use U-Sql to prepare this other data for direct import in ADW, so Azure Data Factory is not longer required to get the data into you data warehouse. 此外,它使您可以使用U-Sql准备此其他数据以便在ADW中直接导入,因此不再需要Azure数据工厂将数据导入数据仓库。 See this blogpost for more information: 有关更多信息,请参阅此博文

A common use case for ADLS and SQL DW is the following. ADLS和SQL DW的常见用例如下。 Raw data is ingested into ADLS from a variety of sources. 原始数据从各种来源被摄入ADLS。 Then ADL Analytics is used to clean and process the data into a loading ready format. 然后,ADL Analytics用于清理数据并将其处理为加载就绪格式。 From there, the high value data can be imported into Azure SQL DW via PolyBase. 从那里,可以通过PolyBase将高价值数据导入Azure SQL DW。

.. ..

You can import data stored in ORC, RC, Parquet, or Delimited Text file formats directly into SQL DW using the Create Table As Select (CTAS) statement over an external table. 您可以使用外部表上的Create Table As Select(CTAS)语句将以ORC,RC,Parquet或Delimited Text文件格式存储的数据直接导入SQL DW。

Please note that the SQL statement in SQL Data Warehouse is currently NOT generating U-SQL behind the scenes. 请注意,SQL数据仓库中的SQL语句当前不会在后台生成U-SQL。 Also, the use cases between ADLA/U-SQL and SDW are different. 此外,ADLA / U-SQL和SDW之间的用例也不同。

ADLA is giving you an processing engine to do batch data preparation/cooking to generate your data to build a data mart/warehouse that you then can read interactively with SQL DW. ADLA为您提供了一个处理引擎来进行批量数据准备/烹饪,以生成您的数据以构建数据集市/仓库,然后您可以使用SQL DW以交互方式阅读。 In your example above, you seem to be mainly doing the second part. 在上面的例子中,你似乎主要是做第二部分。 Adding "Views" on top on these EXTERNAL tables to do transformations in SQL DW will quickly run into scalability limits if you operating on big data (and not just a few 100k rows). 如果您对大数据(而不仅仅是几十万行)进行操作,在这些EXTERNAL表的顶部添加“视图”以在SQL DW中进行转换将很快遇到可伸缩性限制。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Azure 数据湖 VS Azure HDInsight - Azure Data lake VS Azure HDInsight Azure 数据湖分析与 Azure SQL 数据仓库 - Azure Data Lake Analytics Vs Azure SQL Data Warehouse HDInsight Spark群集-无法连接到Azure Data Lake Store - HDInsight Spark cluster - can't connect to Azure Data Lake Store 使用Azure Data Lake时是否需要Data Warehouse? - Is there any need of Data Warehouse when using Azure Data Lake? 将增量数据加载到Azure Data Lake和Azure Data Warehouse中的最佳实践 - Best practices to implement incremental data load into azure data lake & azure data warehouse Azure 数据湖第 1 代与第 2 代 - Azure Data Lake Gen 1 vs Gen 2 Azure SQL 数据仓库 Polybase 查询到 Azure Data Lake Gen 2 返回零行 - Azure SQL Data Warehouse Polybase Query to Azure Data Lake Gen 2 returns zero rows 无法从HDInsight群集提供对Azure Data Lake Store文件的访问 - Not able to provide access to Azure Data Lake Store files from HDInsight cluster 将 Parquet 文件从 Azure 数据湖存储帐户复制到 Synapse 数据仓库表失败 - Copy parquet file from Azure data lake storage account to Synapse data warehouse table failed How to decide between Azure Data Lake vs Azure SQL vs Azure Data Lake Analytics vs Azure SQL VM? - How to decide between Azure Data Lake vs Azure SQL vs Azure Data Lake Analytics vs Azure SQL VM?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM