简体   繁体   中英

How to get modified date as column in table while ingesting all files from year/month/date directories of storage account?

I have some json files in ADLS account. The files are ingested in multiple Year/Month/Day directory structure. I want to copy all the files from ADLS to Azure SQL DB using azure data flow.
I am able to ingest the data from using data flow but I want to include the file path, file ingestion date along with the file name in three separate columns but I do not know how to get these values.

Please note that each Day directory has more than one file as following:

container_name/Dataset/Year/Month/Day/file1.json.file2.json,file3.json

Could any one help me, how do I ingest the modified date column in table with data of each files

tried using getmedata to copy each file on by one also in dataflow derived column for any modified date

I have reproduced the above and able to get the desired file by using combination of addional column option in copy activity , lookup and Get Meta data activity .

In this these are my datasets which I have used at various activities with dataset parameters.

Source_files_wild_path:

在此处输入图像描述

temporary_filepaths:

在此处输入图像描述

Each_file:

在此处输入图像描述

intermediate:

在此处输入图像描述

target_folder:

在此处输入图像描述

AFAIK , in ADF we can get the last modified date of files either by REST APIs or Get Meta data . But Get Meta data won't work with dynamic file paths with a folder structure like yours.

Also, we can get the file path of a blob file either from triggers or additonal column option of copy activity only . Here, as there is no usage of triggers, I have used the 2nd method.

  • So, First I have used a copy activity with wild card path to all source files and added $$FILEPATH as column and copied to a temporary file temp1.csv with Merge files as copy behavior.

  • Then I have used a lookup activity to temp1.csv to get the file as array of objects by which I can get the file paths list.

  • Here I have created two variables of array type.

    在此处输入图像描述

  • As it is lookup output is an array objects, to get only the filename object array, use a for loop and append the @item().filepath to path_list array.

  • Then use the below expression to get the unique list of all file paths in unique_path_list array.

    @union(variables('path_list'),variables('path_list'))

  • Now, use this array in a ForEach and inside Foreach , use a Get Meta data activity with each_file dataset and @item() as filename and add the filedsList like Item name and Last modified .

  • Then use copy activity inside Foreach, and use the same dataset. Here add the additional columns like filename, filepath and last modified and give those values.

  • In sink of this copy activity use another temporary folder and staging(dataset intermediate ). give random file name using date function.

  • After ForEach, use another copy activity with intermediate dataset as source (use wild card path *.csv and give any empty string to dataset parameter) and target_folder folder as sink to get the result file by using merge files.

My pipeline JSON:

{
"name": "last_modifed_pipeline_copy1",
"properties": {
    "activities": [
        {
            "name": "for_paths_columns",
            "type": "Copy",
            "dependsOn": [],
            "policy": {
                "timeout": "0.12:00:00",
                "retry": 0,
                "retryIntervalInSeconds": 30,
                "secureOutput": false,
                "secureInput": false
            },
            "userProperties": [],
            "typeProperties": {
                "source": {
                    "type": "DelimitedTextSource",
                    "additionalColumns": [
                        {
                            "name": "filepath",
                            "value": "$$FILEPATH"
                        }
                    ],
                    "storeSettings": {
                        "type": "AzureBlobFSReadSettings",
                        "recursive": true,
                        "wildcardFolderPath": "*/*/*",
                        "wildcardFileName": "*.csv",
                        "enablePartitionDiscovery": false
                    },
                    "formatSettings": {
                        "type": "DelimitedTextReadSettings"
                    }
                },
                "sink": {
                    "type": "DelimitedTextSink",
                    "storeSettings": {
                        "type": "AzureBlobFSWriteSettings",
                        "copyBehavior": "MergeFiles"
                    },
                    "formatSettings": {
                        "type": "DelimitedTextWriteSettings",
                        "quoteAllText": true,
                        "fileExtension": ".txt"
                    }
                },
                "enableStaging": false,
                "translator": {
                    "type": "TabularTranslator",
                    "typeConversion": true,
                    "typeConversionSettings": {
                        "allowDataTruncation": true,
                        "treatBooleanAsNumber": false
                    }
                }
            },
            "inputs": [
                {
                    "referenceName": "Source_files_wild_card_path",
                    "type": "DatasetReference"
                }
            ],
            "outputs": [
                {
                    "referenceName": "temporary_filepaths",
                    "type": "DatasetReference"
                }
            ]
        },
        {
            "name": "Lookup1",
            "type": "Lookup",
            "dependsOn": [
                {
                    "activity": "for_paths_columns",
                    "dependencyConditions": [
                        "Succeeded"
                    ]
                }
            ],
            "policy": {
                "timeout": "0.12:00:00",
                "retry": 0,
                "retryIntervalInSeconds": 30,
                "secureOutput": false,
                "secureInput": false
            },
            "userProperties": [],
            "typeProperties": {
                "source": {
                    "type": "DelimitedTextSource",
                    "storeSettings": {
                        "type": "AzureBlobFSReadSettings",
                        "recursive": true,
                        "enablePartitionDiscovery": false
                    },
                    "formatSettings": {
                        "type": "DelimitedTextReadSettings"
                    }
                },
                "dataset": {
                    "referenceName": "temporary_filepaths",
                    "type": "DatasetReference"
                },
                "firstRowOnly": false
            }
        },
        {
            "name": "append filepaths array",
            "type": "ForEach",
            "dependsOn": [
                {
                    "activity": "Lookup1",
                    "dependencyConditions": [
                        "Succeeded"
                    ]
                }
            ],
            "userProperties": [],
            "typeProperties": {
                "items": {
                    "value": "@activity('Lookup1').output.value",
                    "type": "Expression"
                },
                "isSequential": true,
                "activities": [
                    {
                        "name": "Append variable1",
                        "type": "AppendVariable",
                        "dependsOn": [],
                        "userProperties": [],
                        "typeProperties": {
                            "variableName": "path_list",
                            "value": {
                                "value": "@item().filepath",
                                "type": "Expression"
                            }
                        }
                    }
                ]
            }
        },
        {
            "name": "get_unique_paths array",
            "type": "SetVariable",
            "dependsOn": [
                {
                    "activity": "append filepaths array",
                    "dependencyConditions": [
                        "Succeeded"
                    ]
                }
            ],
            "userProperties": [],
            "typeProperties": {
                "variableName": "unique_path_list",
                "value": {
                    "value": "@union(variables('path_list'),variables('path_list'))",
                    "type": "Expression"
                }
            }
        },
        {
            "name": "adds_last modifed column",
            "type": "ForEach",
            "dependsOn": [
                {
                    "activity": "get_unique_paths array",
                    "dependencyConditions": [
                        "Succeeded"
                    ]
                }
            ],
            "userProperties": [],
            "typeProperties": {
                "items": {
                    "value": "@variables('unique_path_list')",
                    "type": "Expression"
                },
                "isSequential": true,
                "activities": [
                    {
                        "name": "Get Metadata1",
                        "type": "GetMetadata",
                        "dependsOn": [],
                        "policy": {
                            "timeout": "0.12:00:00",
                            "retry": 0,
                            "retryIntervalInSeconds": 30,
                            "secureOutput": false,
                            "secureInput": false
                        },
                        "userProperties": [],
                        "typeProperties": {
                            "dataset": {
                                "referenceName": "Each_file",
                                "type": "DatasetReference",
                                "parameters": {
                                    "filename": {
                                        "value": "@item()",
                                        "type": "Expression"
                                    }
                                }
                            },
                            "fieldList": [
                                "itemName",
                                "lastModified"
                            ],
                            "storeSettings": {
                                "type": "AzureBlobFSReadSettings",
                                "enablePartitionDiscovery": false
                            },
                            "formatSettings": {
                                "type": "DelimitedTextReadSettings"
                            }
                        }
                    },
                    {
                        "name": "Copy data2",
                        "type": "Copy",
                        "dependsOn": [
                            {
                                "activity": "Get Metadata1",
                                "dependencyConditions": [
                                    "Succeeded"
                                ]
                            }
                        ],
                        "policy": {
                            "timeout": "0.12:00:00",
                            "retry": 0,
                            "retryIntervalInSeconds": 30,
                            "secureOutput": false,
                            "secureInput": false
                        },
                        "userProperties": [],
                        "typeProperties": {
                            "source": {
                                "type": "DelimitedTextSource",
                                "additionalColumns": [
                                    {
                                        "name": "file_path",
                                        "value": "$$FILEPATH"
                                    },
                                    {
                                        "name": "file_name",
                                        "value": {
                                            "value": "@activity('Get Metadata1').output.itemName",
                                            "type": "Expression"
                                        }
                                    },
                                    {
                                        "name": "last_modifed",
                                        "value": {
                                            "value": "@activity('Get Metadata1').output.lastModified",
                                            "type": "Expression"
                                        }
                                    }
                                ],
                                "storeSettings": {
                                    "type": "AzureBlobFSReadSettings",
                                    "recursive": true,
                                    "enablePartitionDiscovery": false
                                },
                                "formatSettings": {
                                    "type": "DelimitedTextReadSettings"
                                }
                            },
                            "sink": {
                                "type": "DelimitedTextSink",
                                "storeSettings": {
                                    "type": "AzureBlobFSWriteSettings"
                                },
                                "formatSettings": {
                                    "type": "DelimitedTextWriteSettings",
                                    "quoteAllText": true,
                                    "fileExtension": ".txt"
                                }
                            },
                            "enableStaging": false,
                            "translator": {
                                "type": "TabularTranslator",
                                "typeConversion": true,
                                "typeConversionSettings": {
                                    "allowDataTruncation": true,
                                    "treatBooleanAsNumber": false
                                }
                            }
                        },
                        "inputs": [
                            {
                                "referenceName": "Each_file",
                                "type": "DatasetReference",
                                "parameters": {
                                    "filename": {
                                        "value": "@item()",
                                        "type": "Expression"
                                    }
                                }
                            }
                        ],
                        "outputs": [
                            {
                                "referenceName": "intermediate",
                                "type": "DatasetReference",
                                "parameters": {
                                    "file_name": {
                                        "value": "@concat(utcNow(),'.csv')",
                                        "type": "Expression"
                                    }
                                }
                            }
                        ]
                    }
                ]
            }
        },
        {
            "name": "Copy data3",
            "type": "Copy",
            "dependsOn": [
                {
                    "activity": "adds_last modifed column",
                    "dependencyConditions": [
                        "Succeeded"
                    ]
                }
            ],
            "policy": {
                "timeout": "0.12:00:00",
                "retry": 0,
                "retryIntervalInSeconds": 30,
                "secureOutput": false,
                "secureInput": false
            },
            "userProperties": [],
            "typeProperties": {
                "source": {
                    "type": "DelimitedTextSource",
                    "storeSettings": {
                        "type": "AzureBlobFSReadSettings",
                        "recursive": true,
                        "wildcardFileName": "*.csv",
                        "enablePartitionDiscovery": false
                    },
                    "formatSettings": {
                        "type": "DelimitedTextReadSettings"
                    }
                },
                "sink": {
                    "type": "DelimitedTextSink",
                    "storeSettings": {
                        "type": "AzureBlobFSWriteSettings",
                        "copyBehavior": "MergeFiles"
                    },
                    "formatSettings": {
                        "type": "DelimitedTextWriteSettings",
                        "quoteAllText": true,
                        "fileExtension": ".txt"
                    }
                },
                "enableStaging": false,
                "translator": {
                    "type": "TabularTranslator",
                    "typeConversion": true,
                    "typeConversionSettings": {
                        "allowDataTruncation": true,
                        "treatBooleanAsNumber": false
                    }
                }
            },
            "inputs": [
                {
                    "referenceName": "intermediate",
                    "type": "DatasetReference",
                    "parameters": {
                        "file_name": "No value"
                    }
                }
            ],
            "outputs": [
                {
                    "referenceName": "target_folder",
                    "type": "DatasetReference"
                }
            ]
        }
    ],
    "variables": {
        "path_list": {
            "type": "Array"
        },
        "unique_path_list": {
            "type": "Array"
        }
    },
    "annotations": [],
    "lastPublishTime": "2023-01-27T12:40:51Z"
},
"type": "Microsoft.DataFactory/factories/pipelines"
}

My pipeline:

在此处输入图像描述

Result file:

在此处输入图像描述

NOTE:

If you want run this on a regular basis, use Storage event trigger by which you can use trigger parameters like @triggerBody().folderPath and @triggerBody().fileName . you can give these to Get Meta data to get last modified time and then pass it to copy activity or dataflow to add as additonal column as per your requirement.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM