简体繁体 English

使用 Azure 数据工厂管道从具有“类似文件夹的结构”的 Blob 存储中获取元数据

[英]Get metadata from Blob storage with "folder like structure" using Azure Data Factory pipeline

原文 2022-06-20 13:49:49 6 1 azure/ azure-pipelines/ azure-data-factory

I will get straight to the point.我会直截了当。 This is the problem:这就是问题：

I have an Azure storage account with Blob storage in which I have multiple containers.我有一个带有 Blob 存储的 Azure 存储帐户，其中有多个容器。 In these containers, I do have a "folder-like structure" made up of directories and subdirectories (I guess this would be proper terminology for it because in the dataset I do have field with "Directory" right after container as you can see in the picture.在这些容器中，我确实有一个由目录和子目录组成的“类似文件夹的结构”（我想这将是正确的术语，因为在数据集中我确实在容器之后有带有“目录”的字段，如您在图片。

The structure is following(for simplicity I will make it shorter but still representative):结构如下（为简单起见，我将使其更短但仍然具有代表性）：

I need to get Metadata from the CSV files (particularly name of the file) so I can add aditional logic to the pipeline so it knows what files to copy.我需要从 CSV 文件（特别是文件名）中获取元数据，这样我就可以向管道添加额外的逻辑，以便它知道要复制哪些文件。 What is the best solution to get these filenames?获取这些文件名的最佳解决方案是什么？

I have tried to use For Each statement.我尝试使用 For Each 语句。 First of all I created Dateset where I only specified the container name and I used it in the Get Metadata activity where I got output in form of list of years (I listed childitems).首先，我创建了 Dateset，我只指定了容器名称，并在 Get Metadata 活动中使用了它，在那里我以年份列表的形式获得了输出（我列出了子项）。 Then I created another Dataset but this time parametrized where I defined directory as @dataset().FileName (I did not define the file name).然后我创建了另一个数据集，但这次参数化了我将目录定义为@dataset().FileName （我没有定义文件名）。 I used this dataset in the For Each loop with Get Matadata activity where I was able to get list of numbers of months like you can see in the file structure above.我在 For Each 循环中使用了这个数据集和 Get Matadata 活动，我可以在上面的文件结构中看到月份数的列表。 Then I went on to create third dataset(I thought this was already dumb but I gave it a shot) where I wanted to include two parameters in the directory field which would be concatenated.然后我继续创建第三个数据集（我认为这已经很愚蠢但我试了一下），我想在目录字段中包含两个参数，这两个参数将被连接起来。 Here I found out that I could not use the parameter of previous dataset in another dataset.在这里我发现我无法在另一个数据集中使用先前数据集的参数。 So i thought maybe I could use variable... I was not able to use this also because I got error everytime I wanted to use variable in "Add dynamic content".所以我想也许我可以使用变量......我也无法使用它，因为每次我想在“添加动态内容”中使用变量时都会出错。 So then I tried to use dataset where I defined only container and file name but I ended up with getting results only for default value set for file name at the dataset level.因此，我尝试使用仅定义容器和文件名的数据集，但最终得到的结果仅为在数据集级别为文件名设置的默认值。

Since I am quite new to ADF and creating pipelines I wonder what am I missing.由于我对 ADF 和创建管道很陌生，我想知道我错过了什么。 What would be your proposed solution to get the file names of the CSV docs so I can use them later on within the pipeline?您提出的获取 CSV 文档文件名的解决方案是什么，以便我以后可以在管道中使用它们？

1 个解决方案

I have repro'd by iterating through multiple sub-folders inside For Each activity using execute pipeline activity.我已经通过使用执行管道活动迭代 For Each 活动中的多个子文件夹来进行复制。

Source dataset:源数据集：

Create a dataset for the source and add the dataset parameter for passing the value dynamically.为源创建数据集并添加数据集参数以动态传递值。

Main pipeline:主要管道：

Using the Get Metadata activity, get the folders inside the given container.使用Get Metadata活动，获取给定容器内的文件夹。

Pass the child items to the ForEach activity.将子项传递给ForEach活动。 Inside ForEach, add execute pipeline to call another pipeline to get the subfolder for each current item ( @item().name ).在 ForEach 中，添加execute pipeline以调用另一个管道以获取每个当前项目的子文件夹 ( @item().name )。

Child pipeline1 (to get the subfolders):子管道1 （获取子文件夹）：

In the child pipeline, create a pipeline parameter to get the current item name (main folder name) from the parent pipeline.在子管道中，创建管道参数以从父管道获取当前项名称（主文件夹名称）。

Using the Get Metadata activity, get the subfolders list.使用Get Metadata活动，获取子文件夹列表。 Use the parameters in the dataset.使用数据集中的参数。

Dataset property value: @concat(pipeline().parameters.dir1,'/')数据集属性值： @concat(pipeline().parameters.dir1,'/')

Pass the child items to ForEach activity.将子项传递给 ForEach 活动。 Inside ForEach, you use filter activity to filter out the sub folder name if required.在 ForEach 中，如果需要，您可以使用过滤器活动来过滤掉子文件夹名称。 Then pass the required current item to execute pipeline activity to call the child pipeline (which gets the files from each sub folder).然后传递所需的当前项以执行管道活动以调用子管道（从每个子文件夹中获取文件）。 Pass the child pipeline parameter value from here.从此处传递子管道参数值。

@concat(pipeline().parameters.dir1,'/',item().name,'/')

Child pipeline2 (gets the files and process):子管道2 （获取文件和进程）：

Create the pipeline parameter to get the value from its parent pipeline.创建管道参数以从其父管道获取值。

Using the Get Metadata activity get the files from each sub folder by passing the parameter value to the dataset parameter.使用 Get Metadata 活动，通过将参数值传递给数据集参数，从每个子文件夹中获取文件。

Pass the output child items to ForEach activity.将输出子项传递给 ForEach 活动。 Inside ForEach, you can use filter activity to filter out the files.在 ForEach 中，您可以使用过滤器活动来过滤掉文件。
Using Copy data activity to can copy the required files to the sink.使用复制数据活动可以将所需的文件复制到接收器。

Dataset properties:数据集属性：

Dir - @concat(pipeline().parameters.path,'/',item().name)目录 - @concat(pipeline().parameters.path,'/',item().name)