简体   繁体   English

在 Azure 数据工厂中根据文件名创建文件夹

[英]Create Folder Based on File Name in Azure Data Factory

I have a requirement to copy few files from an ADLS Gen1 location to another ADLS Gen1 location, but have to create folder based on file name.我需要将几个文件从 ADLS Gen1 位置复制到另一个 ADLS Gen1 位置,但必须根据文件名创建文件夹。

I am having few files as below in the source ADLS:我在源 ADLS 中的文件很少,如下所示:

ABCD_20200914_AB01_Part01.csv.gz
ABCD_20200914_AB02_Part01.csv.gz
ABCD_20200914_AB03_Part01.csv.gz
ABCD_20200914_AB03_Part01.json.gz
ABCD_20200914_AB04_Part01.json.gz
ABCD_20200914_AB04_Part01.csv.gz

Scenario-1 I have to copy these files into destination ADLS as below with only csv file and create folder from file name (If folder exists, copy to that folder) :场景 1我必须将这些文件复制到目标 ADLS 中,如下所示,只有 csv 文件,并从文件名创建文件夹(如果文件夹存在,复制到该文件夹​​):

AB01-
    |-ABCD_20200914_AB01_Part01.csv.gz
AB02-
    |-ABCD_20200914_AB02_Part01.csv.gz
AB03-
    |-ABCD_20200914_AB03_Part01.csv.gz
AB04-
    |-ABCD_20200914_AB04_Part01.csv.gz

Scenario-2 I have to copy these files into destination ADLS as below with only csv and json files and create folder from file name (If folder exists, copy to that folder):场景 2我必须将这些文件复制到目标 ADLS 中,如下所示,只有 csv 和 json 文件,并从文件名创建文件夹(如果文件夹存在,复制到该文件夹​​):

AB01-
    |-ABCD_20200914_AB01_Part01.csv.gz
AB02-
    |-ABCD_20200914_AB02_Part01.csv.gz
AB03-
    |-ABCD_20200914_AB03_Part01.csv.gz
    |-ABCD_20200914_AB03_Part01.json.gz
AB04-
    |-ABCD_20200914_AB04_Part01.csv.gz
    |-ABCD_20200914_AB04_Part01.json.gz

Is there any way to achieve this in Data Factory?有没有办法在数据工厂中实现这一点? Appreciate any leads!感谢任何线索!

So I am not sure if this will entirely help, but I had a similar situation where we have 1 zip file and I had to copy those files out into their own folders.所以我不确定这是否会完全有帮助,但我遇到了类似的情况,我们有 1 个 zip 文件,我不得不将这些文件复制到他们自己的文件夹中。

So what you can do is use parameters in the datasink that you would be using, plus a variable activity where you would do a substring.因此,您可以做的是在您将使用的数据接收器中使用参数,以及您将在其中执行子字符串的可变活动。

The job below is more for the delta job, but I think has enough stuff in it to hopefully help.下面的工作更多是针对 delta 工作的,但我认为其中有足够的东西希望有所帮助。 My job can be divided into 3 sections.我的工作可以分为 3 个部分。

在此处输入图片说明

The first Orange section gets the latest file name date from ADLS gen 1 folder that you want to copy.第一个橙色部分从您要复制的 ADLS gen 1 文件夹中获取最新的文件名日期。

It is then moved to the orange block.然后它被移动到橙色块。 On the bottom I get the latest file name based on the ADLS gen 1 date and then I do a sub-string where I take out the date portion of the file.在底部,我根据 ADLS gen 1 日期获得最新的文件名,然后我执行一个子字符串,从中取出文件的日期部分。 In your case you might be able to do an array and capture all of the folder names that you need.在您的情况下,您可以做一个数组并捕获您需要的所有文件夹名称。

Getting file name获取文件名在此处输入图片说明

Getting Substring获取子串在此处输入图片说明

On the top section I get first extract and unzip that file into a test landing zone.在顶部,我首先提取该文件并将其解压缩到测试着陆区。

Source来源在此处输入图片说明

Sink下沉在此处输入图片说明

I then get the names of all the files that were in that zip file to them be used in the ForEach Activity.然后,我获取该 zip 文件中所有文件的名称,以便在 ForEach 活动中使用它们。 These file names will then become folders for the copy activity.这些文件名将成为复制活动的文件夹。

Get File names from initial landing zone:从初始登陆区获取文件名: 在此处输入图片说明

I then pass on those childitems from "Get list of staged files" into ForEach:然后我将这些子项从“获取暂存文件列表”传递到 ForEach:

在此处输入图片说明

In that ForEach activity I have one copy activity.在那个 ForEach 活动中,我有一个副本活动。 For that I made to datasets.为此,我制作了数据集。 One to grab the files from the initial landing zone that we have created.从我们创建的初始登陆区获取文件。 For this example lets call it Staging (forgive the ms paint drawing):对于这个例子,我们称之为 Staging(原谅 ms 绘图):

在此处输入图片说明

The purpose of this is to go to that dummy folder and grab each file that was just copied into there.这样做的目的是转到那个虚拟文件夹并获取刚刚复制到那里的每个文件。 From that 1 zip file we expect 5 files.从那 1 个 zip 文件中,我们预计有 5 个文件。

In the Sink section what I did is create a new dataset with a parameter for folder and file name.在 Sink 部分,我所做的是创建一个带有文件夹和文件名参数的新数据集。 In that dataset I have am putting that data into same container, but created a new folder called "Stage" and concatenated it with the item name.在该数据集中,我将该数据放入同一个容器中,但创建了一个名为“Stage”的新文件夹并将其与项目名称连接起来。 I also added a "replace" command to remove the ".txt" from the file name.我还添加了一个“替换”命令来从文件名中删除“.txt”。

在此处输入图片说明

What this will do then is what ever the file name that is coming from that dummy staging it will then have a folder name specifically for each file.这将做的是来自该虚拟暂存的文件名,然后它将具有专门针对每个文件的文件夹名称。 Based on your requirements I am not sure if that is what you want to do, but you can always rework that to be more specific.根据您的要求,我不确定这是否是您想要做的,但是您可以随时对其进行修改以使其更具体。

For Item name I basically get the same file name, then replace the ".txt", concat the name of the date value, and only after that add the ".txt" extension.对于项目名称,我基本上得到相同的文件名,然后替换“.txt”,连接日期值的名称,然后才添加“.txt”扩展名。 Otherwise I would have had to ".txt" in the file name.否则我将不得不在文件名中添加“.txt”。

In the end I have created a delete activity that will then be used to delete all the files (I am not sure if have set that up properly so feel free to adjust obviously).最后,我创建了一个删除活动,然后将用于删除所有文件(我不确定是否已正确设置,因此可以随意调整)。

在此处输入图片说明

Hopefully the description above gave you an idea on how to use parameters for your files.希望上面的描述让您对如何使用文件参数有所了解。 Let me know if this helps you in your situation.如果这对您的情况有帮助,请告诉我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Azure数据工厂-从Azure Blob存储读取文件夹中的所有文件时记录文件名 - Azure Data Factory - Recording file name when reading all files in folder from Azure Blob Storage Azure 数据工厂:output 数据集文件名来自输入数据集文件夹名称 - Azure Data Factory: output dataset file name from input dataset folder name Azure 数据工厂创建一个空的 csv 文件 - Azure Data Factory to create an empty csv file Azure数据工厂选择SFTP路径文件夹名称无效? - The folder name is invalid on selecting SFTP path in Azure data factory? 用于创建新文件夹的 Azure 数据工厂 utcNow() 动态函数 - Azure Data Factory utcNow() dynamic function used to create new folder 无法使用 azure 数据工厂在 blob 容器内创建文件夹 - Unable to create a folder inside a blob container using azure data factory Azure 数据工厂:如何根据文件的创建日期复制特定文件? - Azure Data Factory: How to copy specific files based on file's create date? 如何使用数据工厂基于日期创建文件夹? - How to create a folder based on date using Data Factory? azure数据工厂:如何将一个文件夹中的所有文件合并为一个文件 - azure data factory: how to merge all files of a folder into one file 获取文件夹 [Azure Data Factory] ​​中最新添加的文件 - get the latest added file in a folder [Azure Data Factory]
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM