繁体   English   中英

从存储帐户的年/月/日目录中获取所有文件时,如何将修改日期作为表中的列获取?

[英]How to get modified date as column in table while ingesting all files from year/month/date directories of storage account?

我在 ADLS 帐户中有一些 json 文件。 这些文件以多个年/月/日目录结构摄取。 我想使用 azure 数据流将所有文件从 ADLS 复制到 Azure SQL DB。
我能够使用数据流提取数据,但我想在三个单独的列中包含文件路径、文件提取日期和文件名,但我不知道如何获取这些值。

请注意,每个 Day 目录都有多个文件,如下所示:

container_name/Dataset/Year/Month/Day/file1.json.file2.json,file3.json

任何人都可以帮助我,我如何使用每个文件的数据提取表中的修改日期列

尝试使用 getmedata 在任何修改日期的数据流派生列中逐个复制每个文件

我已经复制了上面的内容,并且能够通过在复制活动、查找获取元数据活动中使用附加列选项的组合来获取所需的文件。

这些是我的数据集,我在各种活动中使用了数据集参数。

Source_files_wild_path:

在此处输入图像描述

临时文件路径:

在此处输入图像描述

每个文件:

在此处输入图像描述

中间的:

在此处输入图像描述

目标文件夹:

在此处输入图像描述

AFAIK ,在 ADF 中,我们可以通过REST API 或获取元数据获取文件的最后修改日期。 但是获取元数据不适用于具有像您这样的文件夹结构的动态文件路径。

此外,我们可以从触发器或仅复制活动的附加列选项中获取 blob 文件的文件路径。 在这里,由于没有使用触发器,我使用了第二种方法。

  • 因此,首先,我使用了所有源文件的带通配符路径的复制活动,并将$$FILEPATH添加为列并复制到临时文件temp1.csv中, Merge files作为复制行为。

  • 然后我对temp1.csv使用了一个查找活动来获取文件作为对象数组,通过它我可以获得文件路径列表。

  • 这里我创建了两个数组类型的变量。

    在此处输入图像描述

  • 由于查找 output 是一个数组对象,要仅获取filename object 数组,请使用 for 循环和 append @item().filepathpath_list数组。

  • 然后使用下面的表达式获取unique_path_list数组中所有文件路径的唯一列表。

    @union(variables('path_list'),variables('path_list'))

  • 现在,在 ForEach 和Foreach内部使用此数组,使用带有each_file数据集和@item()作为文件名的Get Meta 数据活动,并添加filedsList ,如Item nameLast modified

  • 然后在 Foreach 中使用复制活动,并使用相同的数据集。 这里添加额外的列,如文件名、文件路径和上次修改时间,并给出这些值。

  • 在此复制活动的接收器中,使用另一个临时文件夹和暂存(数据集中intermediate )。 使用日期 function 给出随机文件名。

  • 在 ForEach 之后,使用另一个以intermediate数据集为源的复制活动(使用通配符路径*.csv并为数据集参数提供任何空字符串)和target_folder文件夹作为接收器,通过合并文件获取结果文件。

我的管道 JSON:

{
"name": "last_modifed_pipeline_copy1",
"properties": {
    "activities": [
        {
            "name": "for_paths_columns",
            "type": "Copy",
            "dependsOn": [],
            "policy": {
                "timeout": "0.12:00:00",
                "retry": 0,
                "retryIntervalInSeconds": 30,
                "secureOutput": false,
                "secureInput": false
            },
            "userProperties": [],
            "typeProperties": {
                "source": {
                    "type": "DelimitedTextSource",
                    "additionalColumns": [
                        {
                            "name": "filepath",
                            "value": "$$FILEPATH"
                        }
                    ],
                    "storeSettings": {
                        "type": "AzureBlobFSReadSettings",
                        "recursive": true,
                        "wildcardFolderPath": "*/*/*",
                        "wildcardFileName": "*.csv",
                        "enablePartitionDiscovery": false
                    },
                    "formatSettings": {
                        "type": "DelimitedTextReadSettings"
                    }
                },
                "sink": {
                    "type": "DelimitedTextSink",
                    "storeSettings": {
                        "type": "AzureBlobFSWriteSettings",
                        "copyBehavior": "MergeFiles"
                    },
                    "formatSettings": {
                        "type": "DelimitedTextWriteSettings",
                        "quoteAllText": true,
                        "fileExtension": ".txt"
                    }
                },
                "enableStaging": false,
                "translator": {
                    "type": "TabularTranslator",
                    "typeConversion": true,
                    "typeConversionSettings": {
                        "allowDataTruncation": true,
                        "treatBooleanAsNumber": false
                    }
                }
            },
            "inputs": [
                {
                    "referenceName": "Source_files_wild_card_path",
                    "type": "DatasetReference"
                }
            ],
            "outputs": [
                {
                    "referenceName": "temporary_filepaths",
                    "type": "DatasetReference"
                }
            ]
        },
        {
            "name": "Lookup1",
            "type": "Lookup",
            "dependsOn": [
                {
                    "activity": "for_paths_columns",
                    "dependencyConditions": [
                        "Succeeded"
                    ]
                }
            ],
            "policy": {
                "timeout": "0.12:00:00",
                "retry": 0,
                "retryIntervalInSeconds": 30,
                "secureOutput": false,
                "secureInput": false
            },
            "userProperties": [],
            "typeProperties": {
                "source": {
                    "type": "DelimitedTextSource",
                    "storeSettings": {
                        "type": "AzureBlobFSReadSettings",
                        "recursive": true,
                        "enablePartitionDiscovery": false
                    },
                    "formatSettings": {
                        "type": "DelimitedTextReadSettings"
                    }
                },
                "dataset": {
                    "referenceName": "temporary_filepaths",
                    "type": "DatasetReference"
                },
                "firstRowOnly": false
            }
        },
        {
            "name": "append filepaths array",
            "type": "ForEach",
            "dependsOn": [
                {
                    "activity": "Lookup1",
                    "dependencyConditions": [
                        "Succeeded"
                    ]
                }
            ],
            "userProperties": [],
            "typeProperties": {
                "items": {
                    "value": "@activity('Lookup1').output.value",
                    "type": "Expression"
                },
                "isSequential": true,
                "activities": [
                    {
                        "name": "Append variable1",
                        "type": "AppendVariable",
                        "dependsOn": [],
                        "userProperties": [],
                        "typeProperties": {
                            "variableName": "path_list",
                            "value": {
                                "value": "@item().filepath",
                                "type": "Expression"
                            }
                        }
                    }
                ]
            }
        },
        {
            "name": "get_unique_paths array",
            "type": "SetVariable",
            "dependsOn": [
                {
                    "activity": "append filepaths array",
                    "dependencyConditions": [
                        "Succeeded"
                    ]
                }
            ],
            "userProperties": [],
            "typeProperties": {
                "variableName": "unique_path_list",
                "value": {
                    "value": "@union(variables('path_list'),variables('path_list'))",
                    "type": "Expression"
                }
            }
        },
        {
            "name": "adds_last modifed column",
            "type": "ForEach",
            "dependsOn": [
                {
                    "activity": "get_unique_paths array",
                    "dependencyConditions": [
                        "Succeeded"
                    ]
                }
            ],
            "userProperties": [],
            "typeProperties": {
                "items": {
                    "value": "@variables('unique_path_list')",
                    "type": "Expression"
                },
                "isSequential": true,
                "activities": [
                    {
                        "name": "Get Metadata1",
                        "type": "GetMetadata",
                        "dependsOn": [],
                        "policy": {
                            "timeout": "0.12:00:00",
                            "retry": 0,
                            "retryIntervalInSeconds": 30,
                            "secureOutput": false,
                            "secureInput": false
                        },
                        "userProperties": [],
                        "typeProperties": {
                            "dataset": {
                                "referenceName": "Each_file",
                                "type": "DatasetReference",
                                "parameters": {
                                    "filename": {
                                        "value": "@item()",
                                        "type": "Expression"
                                    }
                                }
                            },
                            "fieldList": [
                                "itemName",
                                "lastModified"
                            ],
                            "storeSettings": {
                                "type": "AzureBlobFSReadSettings",
                                "enablePartitionDiscovery": false
                            },
                            "formatSettings": {
                                "type": "DelimitedTextReadSettings"
                            }
                        }
                    },
                    {
                        "name": "Copy data2",
                        "type": "Copy",
                        "dependsOn": [
                            {
                                "activity": "Get Metadata1",
                                "dependencyConditions": [
                                    "Succeeded"
                                ]
                            }
                        ],
                        "policy": {
                            "timeout": "0.12:00:00",
                            "retry": 0,
                            "retryIntervalInSeconds": 30,
                            "secureOutput": false,
                            "secureInput": false
                        },
                        "userProperties": [],
                        "typeProperties": {
                            "source": {
                                "type": "DelimitedTextSource",
                                "additionalColumns": [
                                    {
                                        "name": "file_path",
                                        "value": "$$FILEPATH"
                                    },
                                    {
                                        "name": "file_name",
                                        "value": {
                                            "value": "@activity('Get Metadata1').output.itemName",
                                            "type": "Expression"
                                        }
                                    },
                                    {
                                        "name": "last_modifed",
                                        "value": {
                                            "value": "@activity('Get Metadata1').output.lastModified",
                                            "type": "Expression"
                                        }
                                    }
                                ],
                                "storeSettings": {
                                    "type": "AzureBlobFSReadSettings",
                                    "recursive": true,
                                    "enablePartitionDiscovery": false
                                },
                                "formatSettings": {
                                    "type": "DelimitedTextReadSettings"
                                }
                            },
                            "sink": {
                                "type": "DelimitedTextSink",
                                "storeSettings": {
                                    "type": "AzureBlobFSWriteSettings"
                                },
                                "formatSettings": {
                                    "type": "DelimitedTextWriteSettings",
                                    "quoteAllText": true,
                                    "fileExtension": ".txt"
                                }
                            },
                            "enableStaging": false,
                            "translator": {
                                "type": "TabularTranslator",
                                "typeConversion": true,
                                "typeConversionSettings": {
                                    "allowDataTruncation": true,
                                    "treatBooleanAsNumber": false
                                }
                            }
                        },
                        "inputs": [
                            {
                                "referenceName": "Each_file",
                                "type": "DatasetReference",
                                "parameters": {
                                    "filename": {
                                        "value": "@item()",
                                        "type": "Expression"
                                    }
                                }
                            }
                        ],
                        "outputs": [
                            {
                                "referenceName": "intermediate",
                                "type": "DatasetReference",
                                "parameters": {
                                    "file_name": {
                                        "value": "@concat(utcNow(),'.csv')",
                                        "type": "Expression"
                                    }
                                }
                            }
                        ]
                    }
                ]
            }
        },
        {
            "name": "Copy data3",
            "type": "Copy",
            "dependsOn": [
                {
                    "activity": "adds_last modifed column",
                    "dependencyConditions": [
                        "Succeeded"
                    ]
                }
            ],
            "policy": {
                "timeout": "0.12:00:00",
                "retry": 0,
                "retryIntervalInSeconds": 30,
                "secureOutput": false,
                "secureInput": false
            },
            "userProperties": [],
            "typeProperties": {
                "source": {
                    "type": "DelimitedTextSource",
                    "storeSettings": {
                        "type": "AzureBlobFSReadSettings",
                        "recursive": true,
                        "wildcardFileName": "*.csv",
                        "enablePartitionDiscovery": false
                    },
                    "formatSettings": {
                        "type": "DelimitedTextReadSettings"
                    }
                },
                "sink": {
                    "type": "DelimitedTextSink",
                    "storeSettings": {
                        "type": "AzureBlobFSWriteSettings",
                        "copyBehavior": "MergeFiles"
                    },
                    "formatSettings": {
                        "type": "DelimitedTextWriteSettings",
                        "quoteAllText": true,
                        "fileExtension": ".txt"
                    }
                },
                "enableStaging": false,
                "translator": {
                    "type": "TabularTranslator",
                    "typeConversion": true,
                    "typeConversionSettings": {
                        "allowDataTruncation": true,
                        "treatBooleanAsNumber": false
                    }
                }
            },
            "inputs": [
                {
                    "referenceName": "intermediate",
                    "type": "DatasetReference",
                    "parameters": {
                        "file_name": "No value"
                    }
                }
            ],
            "outputs": [
                {
                    "referenceName": "target_folder",
                    "type": "DatasetReference"
                }
            ]
        }
    ],
    "variables": {
        "path_list": {
            "type": "Array"
        },
        "unique_path_list": {
            "type": "Array"
        }
    },
    "annotations": [],
    "lastPublishTime": "2023-01-27T12:40:51Z"
},
"type": "Microsoft.DataFactory/factories/pipelines"
}

我的管道:

在此处输入图像描述

结果文件:

在此处输入图像描述

笔记:

如果您想定期运行它,请使用 Storage 事件触发器,您可以通过它使用触发器参数,如@triggerBody().folderPath@triggerBody().fileName 您可以将这些提供给获取元数据以获取上次修改时间,然后将其传递给复制活动或数据流以根据您的要求添加为附加列。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM