
[英]table1.date1 = get prior 12 months data from date1 in table2.monthyear in Athena? how to get 12 months prior data in year month?
[英]How to get modified date as column in table while ingesting all files from year/month/date directories of storage account?
我在 ADLS 帐户中有一些 json 文件。 这些文件以多个年/月/日目录结构摄取。 我想使用 azure 数据流将所有文件从 ADLS 复制到 Azure SQL DB。
我能够使用数据流提取数据,但我想在三个单独的列中包含文件路径、文件提取日期和文件名,但我不知道如何获取这些值。
请注意,每个 Day 目录都有多个文件,如下所示:
container_name/Dataset/Year/Month/Day/file1.json.file2.json,file3.json
任何人都可以帮助我,我如何使用每个文件的数据提取表中的修改日期列
尝试使用 getmedata 在任何修改日期的数据流派生列中逐个复制每个文件
我已经复制了上面的内容,并且能够通过在复制活动、查找和获取元数据活动中使用附加列选项的组合来获取所需的文件。
这些是我的数据集,我在各种活动中使用了数据集参数。
Source_files_wild_path:
临时文件路径:
每个文件:
中间的:
目标文件夹:
AFAIK ,在 ADF 中,我们可以通过REST API 或获取元数据获取文件的最后修改日期。 但是获取元数据不适用于具有像您这样的文件夹结构的动态文件路径。
此外,我们可以从触发器或仅复制活动的附加列选项中获取 blob 文件的文件路径。 在这里,由于没有使用触发器,我使用了第二种方法。
因此,首先,我使用了所有源文件的带通配符路径的复制活动,并将$$FILEPATH
添加为列并复制到临时文件temp1.csv
中, Merge files
作为复制行为。
然后我对temp1.csv
使用了一个查找活动来获取文件作为对象数组,通过它我可以获得文件路径列表。
这里我创建了两个数组类型的变量。
由于查找 output 是一个数组对象,要仅获取filename
object 数组,请使用 for 循环和 append @item().filepath
到path_list
数组。
然后使用下面的表达式获取unique_path_list
数组中所有文件路径的唯一列表。
@union(variables('path_list'),variables('path_list'))
现在,在 ForEach 和Foreach内部使用此数组,使用带有each_file
数据集和@item()
作为文件名的Get Meta 数据活动,并添加filedsList ,如Item name
和Last modified
。
然后在 Foreach 中使用复制活动,并使用相同的数据集。 这里添加额外的列,如文件名、文件路径和上次修改时间,并给出这些值。
在此复制活动的接收器中,使用另一个临时文件夹和暂存(数据集中intermediate
)。 使用日期 function 给出随机文件名。
在 ForEach 之后,使用另一个以intermediate
数据集为源的复制活动(使用通配符路径*.csv
并为数据集参数提供任何空字符串)和target_folder
文件夹作为接收器,通过合并文件获取结果文件。
我的管道 JSON:
{
"name": "last_modifed_pipeline_copy1",
"properties": {
"activities": [
{
"name": "for_paths_columns",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"additionalColumns": [
{
"name": "filepath",
"value": "$$FILEPATH"
}
],
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"wildcardFolderPath": "*/*/*",
"wildcardFileName": "*.csv",
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings",
"copyBehavior": "MergeFiles"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": false
}
}
},
"inputs": [
{
"referenceName": "Source_files_wild_card_path",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "temporary_filepaths",
"type": "DatasetReference"
}
]
},
{
"name": "Lookup1",
"type": "Lookup",
"dependsOn": [
{
"activity": "for_paths_columns",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"dataset": {
"referenceName": "temporary_filepaths",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "append filepaths array",
"type": "ForEach",
"dependsOn": [
{
"activity": "Lookup1",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "@activity('Lookup1').output.value",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Append variable1",
"type": "AppendVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "path_list",
"value": {
"value": "@item().filepath",
"type": "Expression"
}
}
}
]
}
},
{
"name": "get_unique_paths array",
"type": "SetVariable",
"dependsOn": [
{
"activity": "append filepaths array",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "unique_path_list",
"value": {
"value": "@union(variables('path_list'),variables('path_list'))",
"type": "Expression"
}
}
},
{
"name": "adds_last modifed column",
"type": "ForEach",
"dependsOn": [
{
"activity": "get_unique_paths array",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "@variables('unique_path_list')",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Get Metadata1",
"type": "GetMetadata",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"dataset": {
"referenceName": "Each_file",
"type": "DatasetReference",
"parameters": {
"filename": {
"value": "@item()",
"type": "Expression"
}
}
},
"fieldList": [
"itemName",
"lastModified"
],
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
}
},
{
"name": "Copy data2",
"type": "Copy",
"dependsOn": [
{
"activity": "Get Metadata1",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"additionalColumns": [
{
"name": "file_path",
"value": "$$FILEPATH"
},
{
"name": "file_name",
"value": {
"value": "@activity('Get Metadata1').output.itemName",
"type": "Expression"
}
},
{
"name": "last_modifed",
"value": {
"value": "@activity('Get Metadata1').output.lastModified",
"type": "Expression"
}
}
],
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": false
}
}
},
"inputs": [
{
"referenceName": "Each_file",
"type": "DatasetReference",
"parameters": {
"filename": {
"value": "@item()",
"type": "Expression"
}
}
}
],
"outputs": [
{
"referenceName": "intermediate",
"type": "DatasetReference",
"parameters": {
"file_name": {
"value": "@concat(utcNow(),'.csv')",
"type": "Expression"
}
}
}
]
}
]
}
},
{
"name": "Copy data3",
"type": "Copy",
"dependsOn": [
{
"activity": "adds_last modifed column",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"wildcardFileName": "*.csv",
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings",
"copyBehavior": "MergeFiles"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": false
}
}
},
"inputs": [
{
"referenceName": "intermediate",
"type": "DatasetReference",
"parameters": {
"file_name": "No value"
}
}
],
"outputs": [
{
"referenceName": "target_folder",
"type": "DatasetReference"
}
]
}
],
"variables": {
"path_list": {
"type": "Array"
},
"unique_path_list": {
"type": "Array"
}
},
"annotations": [],
"lastPublishTime": "2023-01-27T12:40:51Z"
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
我的管道:
结果文件:
笔记:
如果您想定期运行它,请使用 Storage 事件触发器,您可以通过它使用触发器参数,如@triggerBody().folderPath
和@triggerBody().fileName
。 您可以将这些提供给获取元数据以获取上次修改时间,然后将其传递给复制活动或数据流以根据您的要求添加为附加列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.