简体   繁体   English

将数据从 ADLS Gen 2 加载到 Azure Synapse

[英]Loading data from ADLS Gen 2 into Azure Synapse

I am trying to load Parquet files from ADLS Gen2 to Synapse using polybase external table feature.我正在尝试使用 polybase 外部表功能将 Parquet 文件从 ADLS Gen2 加载到 Synapse。

Below is the code, but when running the create external table command, the query never completes.下面是代码,但是在运行 create external table 命令时,查询永远不会完成。 On cancelling the query execution, i see this error -在取消查询执行时,我看到了这个错误 -

External file access failed due to internal error: 'Error occurred while accessing HDFS: Java exception raised on call to HdfsBridge_IsDirExist.由于内部错误,外部文件访问失败:“访问 HDFS 时发生错误:调用 HdfsBridge_IsDirExist 时引发 Java 异常。 Java exception message: HdfsBridge::isDirExist - Unexpected error encountered checking whether directory exists or not: UnknownHostException: ''.azuredatalakestore.dfs.core.windows.net' Java 异常消息:HdfsBridge::isDirExist - 检查目录是否存在时遇到意外错误:UnknownHostException: ''.azuredatalakestore.dfs.core.windows.net'

SQL query SQL查询

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>';
GO

DROP CREDENTIAL ADLSCredential
CREATE DATABASE SCOPED CREDENTIAL ADLSCredential
WITH
    IDENTITY = 'user',
    SECRET = '<secret-key>'
;

CREATE EXTERNAL DATA SOURCE AzureDataLakeStorage
WITH (
    TYPE = HADOOP,
    LOCATION = 'abfss://<container>@<storage-account>.azuredatalakestore.dfs.core.windows.net',
    CREDENTIAL = ADLSCredential
);

-- Create an external file format for PARQUET files.  
CREATE EXTERNAL FILE FORMAT parquet  
WITH (  
    FORMAT_TYPE = PARQUET,  
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'  
); 

CREATE EXTERNAL FILE FORMAT uncompressedcsv
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = ',',
        STRING_DELIMITER = '',
        DATE_FORMAT = '',
        USE_TYPE_DEFAULT = False
    )
);

CREATE EXTERNAL TABLE [dbo].[CashReceipts_external] (
    [AMOUNT_APPLIED] [float] NOT NULL,
    [TRX_NUMBER] [nvarchar](50) NULL,
    [SHORT_NAME] [nvarchar](50) NOT NULL,
    [NAME] [nvarchar](1) NULL,
    [CURRENT_RECORD_FLAG] [nvarchar](50) NULL,
    [CURRENCY_CODE] [nvarchar](50) NULL,
    [FUNC_CURRENCY_CODE] [nvarchar](50) NOT NULL,
    [CASH_RCPT_AMOUNT] [float] NULL,
    [CASH_HISTORY_AMOUNT] [float] NULL,
    [FUNC_AMT_HISTORY] [float] NULL,
    [STATUS] [nvarchar](50) NULL,
    [ANTICIPATED_CLEARING_DATE] [nvarchar](50) NULL,
    [CASH_HISTORY_EXCHANGE_RATE] [nvarchar](50) NULL,
    [GL_DATE] [datetime2](7) NULL,
    [GL_PERIOD] [datetime2](7) NOT NULL,
    [BATCH_GL_DATE] [nvarchar](1) NULL,
    [EXCHANGE_RATE] [nvarchar](50) NULL,
    [RECEIPT_NUMBER] [nvarchar](50) NULL,
    [DEPOSIT_DATE] [datetime2](7) NULL,
    [RECEIPT_DATE] [datetime2](7) NULL,
    [ISSUE_DATE] [nvarchar](1) NULL,
    [TYPE] [nvarchar](50) NULL,
    [GL_POSTED_DATE] [datetime2](7) NULL,
    [AMOUNT] [float] NULL
)
WITH
(
    LOCATION='parquetfiles'
,   DATA_SOURCE = AzureDataLakeStorage
,   FILE_FORMAT = parquet
,   REJECT_TYPE = VALUE
,   REJECT_VALUE = 0
)
;

According the error message, the error is caused by the location "parquetfiles".根据错误消息,错误是由位置“parquetfiles”引起的。

Please try the bellow CREATE EXTERNAL DATA SOURCE command:请尝试以下CREATE EXTERNAL DATA SOURCE命令:

CREATE EXTERNAL DATA SOURCE AzureDataLakeStorage
WITH
  ( LOCATION = 'wasbs://<container>@<storage_account>.blob.core.windows.net' ,
    CREDENTIAL = AzureStorageCredential ,
    TYPE = BLOB_STORAGE
  ) ;

When CREATE EXTERNAL TABLE , use the file or folder name, :CREATE EXTERNAL TABLE时,使用文件或文件夹名称,:

CREATE EXTERNAL TABLE [dbo].[CashReceipts_external] (
    [AMOUNT_APPLIED] [float] NOT NULL,
    [TRX_NUMBER] [nvarchar](50) NULL,
    [SHORT_NAME] [nvarchar](50) NOT NULL,
    [NAME] [nvarchar](1) NULL,
    [CURRENT_RECORD_FLAG] [nvarchar](50) NULL,
    [CURRENCY_CODE] [nvarchar](50) NULL,
    [FUNC_CURRENCY_CODE] [nvarchar](50) NOT NULL,
    [CASH_RCPT_AMOUNT] [float] NULL,
    [CASH_HISTORY_AMOUNT] [float] NULL,
    [FUNC_AMT_HISTORY] [float] NULL,
    [STATUS] [nvarchar](50) NULL,
    [ANTICIPATED_CLEARING_DATE] [nvarchar](50) NULL,
    [CASH_HISTORY_EXCHANGE_RATE] [nvarchar](50) NULL,
    [GL_DATE] [datetime2](7) NULL,
    [GL_PERIOD] [datetime2](7) NOT NULL,
    [BATCH_GL_DATE] [nvarchar](1) NULL,
    [EXCHANGE_RATE] [nvarchar](50) NULL,
    [RECEIPT_NUMBER] [nvarchar](50) NULL,
    [DEPOSIT_DATE] [datetime2](7) NULL,
    [RECEIPT_DATE] [datetime2](7) NULL,
    [ISSUE_DATE] [nvarchar](1) NULL,
    [TYPE] [nvarchar](50) NULL,
    [GL_POSTED_DATE] [datetime2](7) NULL,
    [AMOUNT] [float] NULL
)
WITH
(
    LOCATION='[filename]'
,   DATA_SOURCE = AzureDataLakeStorage
,   FILE_FORMAT = parquet
,   REJECT_TYPE = VALUE
,   REJECT_VALUE = 0
)
;

Ref:参考:

  1. Create external data source to reference Azure blob storage 创建外部数据源以引用 Azure blob 存储
  2. Create external table: Arguments : 创建外部表: Arguments

LOCATION = 'folder_or_filepath' Specifies the folder or the file path and file name for the actual data in Hadoop or Azure blob storage. LOCATION = 'folder_or_filepath' 指定 Hadoop 或 Azure blob 存储中实际数据的文件夹或文件路径和文件名。

If you specify LOCATION to be a folder, a PolyBase query that selects from the external table will retrieve files from the folder and all of its subfolders.如果将 LOCATION 指定为文件夹,则从外部表中选择的 PolyBase 查询将从文件夹及其所有子文件夹中检索文件。 Just like Hadoop, PolyBase doesn't return hidden folders.就像 Hadoop 一样,PolyBase 不会返回隐藏文件夹。 It also doesn't return files for which the file name begins with an underline (_) or a period (.).它也不返回文件名以下划线 (_) 或句点 (.) 开头的文件。

In this example, if LOCATION='/webdata/', a PolyBase query will return rows from mydata.txt and mydata2.txt.在此示例中,如果 LOCATION='/webdata/',PolyBase 查询将返回来自 mydata.txt 和 mydata2.txt 的行。 It won't return mydata3.txt because it's a file in a hidden folder.它不会返回 mydata3.txt 因为它是隐藏文件夹中的文件。 And it won't return _hidden.txt because it's a hidden file.它不会返回 _hidden.txt 因为它是一个隐藏文件。

Please note that: One file for one table, we could load multiple files to create the external table!请注意:一个文件一个表,我们可以加载多个文件来创建外部表!

Hope this helps.希望这可以帮助。

The Create External Data source command has an attribute location创建外部数据源命令具有属性位置

LOCATION = 'abfss://<container>@<storage-account>.azuredatalakestore.dfs.core.windows.net'

It should have been应该是

LOCATION = 'abfss://<container>@<storage-account>.dfs.core.windows.net'

I got it mixed up with ADLS Gen 1 Location attribute.我把它与 ADLS Gen 1 Location 属性混淆了。 My bad.Thanks to all for taking time to look into this.Marking this as closed.我的错。感谢大家花时间研究这个。将其标记为已关闭。 I ended up using the AAD app registration token instead of the storage key.我最终使用了 AAD 应用注册令牌而不是存储密钥。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Synapse Serverless Pool 使用 CETAS 将数据写回 ADLS Gen-2 &gt;&gt; 权限问题 - Synapse Server less Pool writing data back to ADLS Gen-2 using CETAS >> Permissions issue 无法使用 ADF 将数据从 ADLS gen2 复制到 SQL 服务器 - Not able copy data from ADLS gen2 to SQL Server using ADF Azure Synapse 尝试将数据类型从 varchar 更改为 BIGINT - Azure Synapse Attempting to Change data type from varchar to BIGINT SQL Polybase 可以从 Azure datalake gen2 读取数据吗? - Can SQL Polybase read data from Azure datalake gen2? Azure Data Factory在从SQL到ADLS的副本上抛出“需要长度”错误 - Azure Data Factory throws 'Length Required" error on copy from SQL to ADLS 从 Azure Synapse Analytics Spark Pool 连接到 Azure SQL 数据库 - Connecting from Azure Synapse Analytics Spark Pool to Azure SQL Database 使用Azure Data Lake Store Gen1中的SSIS包将文件从一个目录移动到另一个目录 - Move Files from one directory to another using SSIS Package in Azure Data Lake Store Gen1 如何在 Azure 数据仓库(突触)中授予架构级别权限? - How to give Schema Level permission in Azure Data Warehouse (Synapse)? Azure Synapse 专用池数据拉入 jupyter notebbok - Azure Synapse dedicated pools data pulling in jupyter notebbok 无法使用 SQLAlchemy 连接到 Azure 数据仓库(现在称为 Synapse) - Cannot connect to Azure Data Warehouse (now called Synapse) using SQLAlchemy
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM