[英]Add Azure Blob Partitions to Azure SQL Table
I have partitioned parquet files in Azure Blob that I am copying to Azure SQL.我在 Azure Blob 中对 Parquet 文件进行了分区,并将其复制到 Azure SQL。 How do I get the partition name into the SQL table?
如何将分区名称放入 SQL 表中?
I've figured out how to get the full file path into the SQL table by adding an Additional Column in the source data section of the Copy Activity (image 1 & 2), but I'm trying to figure out how to regex the full file path down to just the partition name (202105).我已经想出了如何通过在复制活动的源数据部分添加一个附加列来获得 SQL 表的完整文件路径(图像 1 和 2),但我试图弄清楚如何正则表达式完整文件路径到分区名称 (202105)。
In the data preview for the source data in the Copy Activity, it shows the time_period column with just the partition name (image 3).在 Copy Activity 中源数据的数据预览中,它显示了仅包含分区名称的 time_period 列(图 3)。 But when it shows up in SQL it is NULL for all rows (or it's the full file path, depending on if I added Additional Columns in the source data section of the Copy Activity).
但是当它出现在 SQL 中时,它对于所有行都是 NULL(或者它是完整的文件路径,取决于我是否在复制活动的源数据部分添加了附加列)。
I've tried changing the data type for time_period to an INT in Azure SQL.我尝试在 Azure SQL 中将 time_period 的数据类型更改为 INT。 I've tried parsing the
$$FILEPATH
, but nothing I've tried has worked.我试过解析
$$FILEPATH
,但我试过的没有任何效果。
I'm basically starting from scratch as I'm sure there's a better.我基本上是从头开始,因为我确信有更好的。 Extra background here and possibly here .
额外的背景在这里和可能在这里。
As explained here in MS doc, you can utilize enablePartitionDiscovery
feature.正如解释这里在MS文档,你可以利用
enablePartitionDiscovery
功能。
Source : partitioned files:来源:分区文件:
Source Dataset:源数据集:
Just mentioned the container name and leave the directory and file fields empty.刚刚提到了容器名称,并将目录和文件字段留空。 We shall filter them using
WildCard paths
in Copy Activity.我们将在复制活动中使用
WildCard paths
过滤它们。
Configure source in Copy Activity
with respect to your files path:在
Copy Activity
根据您的文件路径配置源:
Note: you can skip the step 4 ie additional column with $$FILEPATH
, just shown for reference.注意:您可以跳过第 4 步,即带有
$$FILEPATH
附加列,仅供参考。 You can drop this bit as you already get the ready column using enablePartitionDiscovery
.您可以删除这一位,因为您已经使用
enablePartitionDiscovery
获得了就绪列。
For a single folder to be picked, you will set as below.对于要选择的单个文件夹,您将设置如下。
Wildcard paths: sink / columnparts / time_period=202105 / *.parquet
通配符路径:
sink / columnparts / time_period=202105 / *.parquet
For multiple folders time_period=202105
, time_period=202106
..... as seen in previous sinp, set as below.对于多个文件夹
time_period=202105
, time_period=202106
..... 如之前的 sinp 中所见,设置如下。
**
will take the place of any folder in the parent folder columnparts
**
将取代父文件夹columnparts
的任何文件夹
Wildcard paths: sink / columnparts / ** / *.parquet
通配符路径:
sink / columnparts / ** / *.parquet
Partition root Path : This should point to the parent folder where all the partitioned folders rest.分区根路径:这应该指向所有分区文件夹所在的父文件夹。
In my example: sink/columnparts
在我的例子中:
sink/columnparts
partition root path must be provided when you enable partition discovery.启用分区发现时必须提供分区根路径。
Sink: Optional update existing table or just create a new one. Sink:可选更新现有表或创建一个新表。
View from SQL DB: time_period
column holds the value 202105
从 SQL DB 中查看:
time_period
列包含值202105
time_period=202105/part-00004-fcbe0bf5-2c93-45f5-9bb2-2f9089a3e83a-c000.snappy.parquet
If you see this error:如果您看到此错误:
You have a mapping that is not updated!您有一个未更新的映射! In the mapping section, you can
clear
or reset
schema and Import schema
again just to be sure.在映射部分,您可以
clear
或reset
架构并再次Import schema
以确保安全。 😊 😊
In my case it was additional column file_path
在我的情况下,它是附加列
file_path
--OR-- - 或者 -
$$FILEPATH
is a reserved variable, you cannot use it in expression builder or in functions to manipulate. $$FILEPATH
是保留变量,不能在表达式构建器或函数中使用它来操作。
Instead if you can incorporate a step after you copy to SQL DB ie use a stored procedure as below.相反,如果您可以在复制到 SQL DB 后合并一个步骤,即使用如下存储过程。
Where column path
holds the full file path received from $$FILEPATH
as you have managed already.列
path
保存从$$FILEPATH
收到的完整文件路径,正如您已经管理的那样。 StoreParquetTest
is the table created in SQL sink
StoreParquetTest
是在 SQL sink
中创建的表
CREATE PROCEDURE trimpath
AS
UPDATE StoreParquetTest
SET path = SUBSTRING(path,(CHARINDEX('=',path) + 1), ((CHARINDEX('/',path) - CHARINDEX('=',path) -1)))
GO
Now you can use the stored procedure activity in Pipeline after Copy Activity.现在您可以在复制活动之后使用管道中的存储过程活动。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.