简体   繁体   English

如何在 Azure 数据工厂中列出按上次修改日期筛选的 Azure 数据湖第 2 代中的所有路径?

[英]How do I list all paths in Azure data lake gen 2 filtered by last modified date in Azure Data Factory?

We have an Azure Data Lake Gen 2 which contains 100's of thousands of JSON messages that come in on a continuous basis.我们有一个 Azure Data Lake Gen 2,其中包含连续传入的数千条 JSON 消息。 These files are stored in a folder structure, but not one based on load time.这些文件存储在文件夹结构中,但不是基于加载时间的。 We now have a requirement that we need to use Azure Data Factory to retrieve all new JSON files since we last ran our pipelines.我们现在需要使用 Azure 数据工厂来检索自上次运行管道以来的所有新 JSON 文件。 Since the get metadata activity doesn't allow for recursive retrieval of files and folders I've been looking at other options.由于获取元数据活动不允许递归检索文件和文件夹,我一直在寻找其他选项。 I know it's possible to use Azure functions but ideally we'd like to use low/no code solutions.我知道可以使用 Azure 函数,但理想情况下我们希望使用低代码/无代码解决方案。 I'm able to list all paths in a given container using the Azure Storage Services API using either the Path option or the List Blobs option .我可以使用 Azure 存储服务 API 使用Path 选项List Blobs 选项列出给定容器中的所有路径。 Unfortunately I cannot seem to find an option to filter this based on the last-modified date.不幸的是,我似乎找不到根据上次修改日期过滤它的选项。 Since we are getting 1000's of new messages in every day we need to limit the response of the API to only those files that have come in since the previous pipeline run.由于我们每天都会收到 1000 条新消息,因此我们需要将 API 的响应限制为仅对自上次管道运行以来进入的那些文件的响应。 Any suggestions on how this could be achieved without an Azure function would be greatly appreciated.任何关于如何在没有 Azure 功能的情况下实现这一点的建议将不胜感激。

You can use getmetadata activity for recursive retrieval as well by making use of ForEach activity next to it.您也可以通过使用旁边的 ForEach 活动来使用 getmetadata 活动进行递归检索。

Use getmetadata activity pointing to the folder and use ChildItems in the fieldlist to retrieve the filenames inside the folder.使用指向文件夹的 getmetadata 活动并使用字段列表中的 ChildItems 来检索文件夹内的文件名。

Use ForEach activity to iterate through each of the files and use Getmetadata pointing to parameterized dataset.使用 ForEach 活动遍历每个文件并使用 Getmetadata 指向参数化数据集。 Inside getmetadata activity, us 'LastModified' in the childItems option to get the last modified datetime for each of the files.在 getmetadata 活动中,我们在 childItems 选项中使用“LastModified”来获取每个文件的最后修改日期时间。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Azure 数据工厂 - Azure 数据湖 Gen1 访问 - Azure Data Factory - Azure Data Lake Gen1 access 我尝试在 Azure 数据工厂和 Azure 数据湖 Gen2 之间进行连接时出错 - Error when I am trying to connect between Azure Data factory and Azure Data lake Gen2 azure数据工厂中数据流如何拉取最后修改的文件? - How do I pull the last modified file with data flow in azure data factory? Azure 数据湖第 1 代与第 2 代 - Azure Data Lake Gen 1 vs Gen 2 如何构建代表 Azure 数据湖(第 2 代)的 Docker 映像? - How do I build a Docker image representing Azure's Data Lake (gen 2)? 对于 Python 3.8 Azure 数据湖 Gen 2,如何检查文件系统上是否存在文件? - For Python 3.8 Azure data lake Gen 2, how do I check if a file exists on a filesystem? 如何通过 Azure Data Lake Store gen1 中的新文件触发 Azure Data Factory v2 或 Azure Databricks Notebook 中的管道 - How to trigger a pipeline in Azure Data Factory v2 or a Azure Databricks Notebook by a new file in Azure Data Lake Store gen1 如何使用 Azure Data Lake Storage Gen2 和 Azure Data factory V2 执行基于事件的数据摄取? - How to perform Event based data ingestion using Azure Data Lake Storage Gen2 and Azure Data factory V2? 用Azure Data Lake gen2列出斑点的问题 - Problem to list blobs with Azure Data Lake gen2 使用分层命名空间列出 Azure Data Lake gen2 中的目录 - List directories in Azure Data Lake gen2 with hierarchical namespace
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM