简体   繁体   中英

Dataflow integration with foreach loop in azure data factory

We have a data lake container weith three folders a,b,c. Each folder has 3 files a1,a2,a3,b1,b2,b3,c1,C2,c3. Now we need to design a pipeline which will dynamically do incremental load from the folders to a blob stroarge with same name file as souce. Incremental load is implemented by me in dataflow. We have other dataflow dependancy as well so we can't use copy activity but dataflow. I am unable to integrate get metadata activity with the dataflow where I am expecting some help.

We have a data lake container weith three folders a,b,c. Each folder has 3 I tried with parameters and variables.But I did not got the desired output. I used get metadata child item. Then a foreach loop. Inside foreach I tried with another fireaceach to get the files. I have used an append variable to append the data. I have already implemented the upsert logic for a single table in dataflow. If I am passing second get matadata active output (inside foreach) to dataflow it does not accepts. The main problem I am facing is to integrate the dataflow with foreach in dataset level. Because the dataset of the dataflow will be dependent on get metadata's output.

Nested for-each is not possible in Azure data factory. Work around is to use execute pipeline inside for-each activity. To pass the output of metadata activity to dataflow, create the dataflow parameters and pass the value to that parameter. I tried to repro this scene in my environment, below is the approach.

Outer Pipeline:

  • Get Metadata activity is taken and only container name is given in the dataset file path. + New is selected in field list and Child item argument is added. This activity will provide the list of all the directories that are present in the container.

在此处输入图像描述

  • For each activity is taken and in items Output of GetMetadata activity is given. @activity('Get Metadata1').output.childItems

在此处输入图像描述

  • Inside for-each activity, execute pipeline activity is added.
  • A new child pipeline is created, and a parameter called FolderName is created in that pipeline.
  • The child pipeline name is given in execute pipeline activity. Value for the parameter is given as @item().name , to pass directory names as input to the child pipeline.

在此处输入图像描述

Child Pipeline:

  • In child pipeline, another Get meta data activity is taken and in the dataset file path, container name is given and for folder, dataset parameter is created and value of pipeline parameter FolderName is passed. @pipeline().parameters.FolderName

  • Child items is selected as an argument in the field list. This activity will give the list of files that are available in the directory.

在此处输入图像描述

  • Then for-each activity is added and in items output of the meta data activity is given. @activity('Get_Metadata_inner').output.childItems

  • Inside for-each, dataflow is added.

Dataflow

  • In dataflow, parameter called filename is created.在此处输入图像描述

  • In Source dataset, dataset parameter is created for filename and foldername as fileName and folderName respectively.

动图51

  • Then all other transformations are added in data flow.

  • In sink dataset of sink transformation, dataset parameter for folder is created and file name is left blank in dataset.

在此处输入图像描述

  • File name is given in sink settings. Value is the dataflow parameter $filename.

在此处输入图像描述

  • In child pipeline, dataflow activity settings is given as in below image. fileName: @item().name folderName (for both source and sink parameter): @pipeline().parameters.FolderName

在此处输入图像描述

  • In Parameters tab, filename value is given as @item().name在此处输入图像描述

  • In this repro, simple select transformation is taken. This can be extended to any transformation in data flow. By this way, we can pass the values to dataflow.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM