加载数据时 Openrowset 内部如何工作

Question

I am going through the azure documentation and come across the following phrase我正在浏览 azure 文档并遇到以下短语

OPENROWSET function in Synapse SQL reads the content of the file(s) from a data source. Synapse SQL 中的 OPENROWSET function 从数据源读取文件的内容。 The data source is an Azure storage account and it can be explicitly referenced in the OPENROWSET function or can be dynamically inferred from URL of the files that you want to read.数据源是 Azure 存储帐户，可以在 OPENROWSET function 中明确引用，也可以从要读取的文件的 URL 动态推断。

where does the data is loaded and processed - is it in memory. Does it load the data in chunks similar to spark does?数据在哪里加载和处理 - 是在 memory 中吗？它是否像 spark 一样以块的形式加载数据？
And also it seems Openrowset is supported with serverless sql pool and not supported with dedicated sql pool - what could have been the rationale in doing so, though both the pools backed up by MS sql server which actually natively supports OPENROWSET.而且似乎 Openrowset 受无服务器 sql 池的支持，而不受专用 sql 池的支持——这样做的理由可能是什么，尽管这两个池都由 MS sql 服务器备份，实际上本机支持 OPENROWSET。

Answer 1

OPENROWSET function in Synapse SQL reads the content of the file(s) from a data source. Synapse SQL 中的 OPENROWSET function 从数据源读取文件的内容。 The data source is an Azure storage account and it can be explicitly referenced in the OPENROWSET function or can be dynamically inferred from URL of the files that you want to read.数据源是 Azure 存储帐户，可以在 OPENROWSET function 中明确引用，也可以从要读取的文件的 URL 动态推断。

where does the data is loaded and processed - is it in memory. Does it load the data in chunks similar to spark does?数据在哪里加载和处理 - 是在 memory 中吗？它是否像 spark 一样以块的形式加载数据？

As, OPENROWSET function is only supported in Serverless Synapse SQL. For now, It uses Serverless architecture, There's one Compute Node, that scales distributed computes according to the needs.因为，OPENROWSET function 仅在无服务器 Synapse SQL 中受支持。目前，它使用无服务器架构，有一个计算节点，可根据需要扩展分布式计算。 Your data is queried in multiple distributed small tasks backed by a compute node unlike dedicated compute node for each task in Dedicated synapse SQL. Distributed Query Processing Engine in Serverless SQL will convert all your SQL queries in a small task and assign those tasks to a Compute node, which will query data from storage account.您的数据在由计算节点支持的多个分布式小任务中查询，这与专用突触 SQL 中每个任务的专用计算节点不同。无服务器 SQL 中的分布式查询处理引擎将在一个小任务中转换所有 SQL 查询并将这些任务分配给计算节点，它将从存储帐户中查询数据。 Serverless Spark pool and Serverless SQL both work on the same architecture of scaling compute when needed to run the queries and scale them down once they are not needed. Serverless Spark pool 和 Serverless SQL 都在相同的架构上工作，即在需要运行查询时扩展计算，并在不需要时缩小计算。

在此处输入图像描述

Image reference - Synapse SQL architecture - Azure Synapse Analytics |图片参考 - Synapse SQL 架构 - Azure Synapse Analytics | Microsoft Learn 微软学习

To read and access files from Azure Storage 2 types of methods are used.要从 Azure 存储中读取和访问文件，使用了 2 种方法。
OPENROWSET and External Table. OPENROWSET 和外部表。

OPENROWSET is used to get the data in the azure storage in the form of row-set, It can be used to connect to remote data source with various azure ad authentication, or It can be used to get bulk data to fetch multiple datasets in the form of row-set from azure storage directly. OPENROWSET用于以row-set的形式获取azure存储中的数据，可用于通过各种azure广告认证连接远程数据源，也可用于获取批量数据，以获取数据集中的多个数据集直接来自 azure 存储的行集形式。 It is similar to the FROM clause of SQL.类似于SQL的FROM子句。

External Table is used to read data located in Hadoop, Azure Storage, Azure Storage Blob, Data lake storage. External Table用于读取位于Hadoop、Azure Storage、Azure Storage Blob、Data lake storage的数据。

And also it seems Openrowset is supported with serverless sql pool and而且似乎 Openrowset 支持无服务器 sql 池和
not supported with dedicated sql pool - what could have been the专用 sql 池不支持 - 可能是什么
rationale in doing so, though both the pools backed up by MS sql这样做的理由，尽管两个池都由 MS sql 支持
server which actually natively supports OPENROWSET.实际上本机支持 OPENROWSET 的服务器。

To connect to an in-frequent reference to a data source OPENROWSET or OPENDATASOURCE methods are used natively with information specified to connect to infrequently accessed Linked Server.要连接到对数据源的不频繁引用，本机使用 OPENROWSET 或 OPENDATASOURCE 方法以及指定的信息以连接到不常访问的链接服务器。 The Rowset is then referenced as a transact SQL statement in an SQL Table.然后，行集在 SQL 表中作为事务 SQL 语句被引用。
For now, Azure dedicated Synapse SQL does not support OPENROWSET function.目前，Azure专用Synapse SQL不支持OPENROWSET function。
Refer here:-参考这里：-
https://learn.microsoft.com/en-us/sql/t-sql/functions/openrowset-transact-sql?view=sql-server-ver16 https://learn.microsoft.com/en-us/sql/t-sql/functions/openrowset-transact-sql?view=sql-server-ver16

OPENROWSET() for Synapse dedicated pools? 用于 Synapse 专用池的 OPENROWSET()？ BY [Stefan Azarić]作者 [Stefan Azarić]

Query:-询问：-

    OPENROWSET
   ({ BULK 'unstructured_data_path' . [DATA_SOURCE = <data source name>, ]
      FORMAT ['PARQUET' | 'DELTA'] }
   )
   [WITH ( {'column_name' 'column_type' }) ]
   [AS] table_alias(column_alias, ...n)

Openrowset uses a FROM clause with bulk with data source set to Azure storage account with format supported for csv, parquet, delta, json. Openrowset 使用带有 bulk 的 FROM 子句，数据源设置为 Azure 存储帐户，格式支持 csv、parquet、delta、json。

在此处输入图像描述

SELECT *
FROM OPENROWSET(
   BULK '<storagefile-url>,
   FORMAT = '<format-of-file>
   PARSER_VERSION = '2.0'
   HEADER_ROW = True
) as rowsFromFile

在此处输入图像描述

WITH CLAUSE -与条款 -

SELECT *
FROM OPENROWSET(
   BULK '<storagefile-url>,
   FORMAT = '<format-of-file>
   PARSER_VERSION = '2.0'
   HEADER_ROW = True
)
WITH
(
   columnname 
) as output-table

在此处输入图像描述

As, this is based on Serverless architecture > each query is distributed in small tasks and ran by a compute node.因为，这是基于无服务器架构 > 每个查询都分布在小任务中并由计算节点运行。

加载数据时 Openrowset 内部如何工作

问题描述

1 个解决方案

解决方案1
0 2023-01-24 07:36:18

加载数据时 Openrowset 内部如何工作

问题描述

1 个解决方案

解决方案1 0 2023-01-24 07:36:18

解决方案1
0 2023-01-24 07:36:18