简体   繁体   English

Azure数据工厂仅从Blob存储中检索新的Blob文件

[英]Azure Data Factory Only Retrieve New Blob files from Blob Storage

I am currently copying blob files from an Azure Blob storage to an Azure SQL Database. 我目前正在将Blob文件从Azure Blob存储复制到Azure SQL数据库。 It is scheduled to run every 15 minutes but each time it runs it repeatedly imports all blob files. 它计划每15分钟运行一次,但是每次运行它都会重复导入所有blob文件。 I would rather like to configure it so that it only imports if any new files have arrived into the Blob storage. 我想对其进行配置,以使其仅在任何新文件到达Blob存储中时才导入。 One thing to note is that the files do not have a date time stamp. 要注意的一件事是文件没有日期时间戳。 All files are present in a single blob container. 所有文件都存在于单个Blob容器中。 New files are added to the same blob container. 新文件将添加到相同的Blob容器中。 Do you know how to configure this? 你知道如何配置吗?

I'd preface this answer with a change in your approach may be warranted... 我想在这个答案的开头加上您的方法可能会有所变化...

Given what you've described your fairly limited on options. 鉴于您所描述的,您在选择权上相当有限。 One approach is to have your scheduled job maintain knowledge of what it has already stored into the SQL db. 一种方法是让计划的作业了解已存储在SQL数据库中的内容。 You loop over all the items within the container and check if it has been processed yet. 您遍历容器中的所有项目,并检查容器是否已处理。

The container has a ListBlobs method that would work for this. 容器有一个ListBlobs方法可以解决这个问题。 Reference: https://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-how-to-use-blobs/ 参考: https : //azure.microsoft.com/en-us/documentation/articles/storage-dotnet-how-to-use-blobs/

foreach (var item in container.ListBlobs(null, true))
{
   // Check if it has already been processed or not
}

Note that the number of blobs in the container may be an issue with this approach. 请注意,此方法可能会引起容器中斑点的数量。 If it is too large consider creating a new container per hour/day/week/etc to hold the blobs, assuming you can control this. 如果太大,请考虑每小时/每天/每周/等创建一个新容器来容纳blob,前提是您可以控制它。

Please use CloudBlobContainer. 请使用CloudBlobContainer。 ListBlobs (null, true, BlobListingDetails .Metadata) and check CloudBlob. ListBlobs (null,true, BlobListingDetails .Metadata)并检查CloudBlob。 Properties . 属性 LastModified for each listed blob. 每个列出的Blob的LastModified

Instead of a copy activity, I would use a custom DotNet activity within Azure Data Factory and use the Blob Storage API (some of the answers here have described the use of this API) and Azure SQL API to perform your copy of only the new files. 我将使用Azure Data Factory中的自定义DotNet活动而不是复制活动,并使用Blob存储API(此处的某些答案描述了此API的用法)和Azure SQL API仅执行新文件的副本。

However, with time, your blob location will have a lot of files, so, expect that your job will start taking longer and longer (after a point taking longer than 15 minutes) as it would iterate through each file every time. 但是,随着时间的流逝,您的Blob位置将有很多文件,因此,希望您的工作将开始花费越来越长的时间(一个点花费的时间超过15分钟),因为它将每次遍历每个文件。

Can you explain your scenario further? 您能否进一步说明您的情况? Is there a reason you want to add data to the SQL tables every 15 minutes? 您是否有理由要每15分钟将数据添加到SQL表中? Can you increase that to copy data every hour? 您可以增加它以每小时复制一次数据吗? Also, how is this data getting into Blob Storage? 另外,这些数据如何进入Blob存储? Is another Azure service putting it there or is it an external application? 是另一个Azure服务将其放置在那里还是外部应用程序? If it is another service, consider moving it straight into Azure SQL and cut out the Blob Storage. 如果这是另一项服务,请考虑将其直接移到Azure SQL中,并切断Blob存储。

Another suggestion would be to create folders for the 15 minute intervals like hhmm. 另一个建议是创建间隔为15分钟的文件夹,例如hhmm。 So, for example, a sample folder would be called '0515'. 因此,例如,示例文件夹将被称为“ 0515”。 You could even have a parent folder for the year, month and day. 您甚至可以为年,月和日创建一个父文件夹。 This way you can insert the data into these folders in Blob Storage. 这样,您可以将数据插入Blob存储中的这些文件夹中。 Data Factory is capable of reading date and time folders and identifying new files that come into the date/time folders. Data Factory能够读取日期和时间文件夹,并标识进入日期/时间文件夹的新文件。

I hope this helps! 我希望这有帮助! If you can provide some more information about your problem, I'd be happy to help you further. 如果您可以提供有关您的问题的更多信息,我们很乐意为您提供进一步的帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Azure数据工厂-如何仅从Blob存储中读取新文件 - Azure Data Factory - How to read only new files from blob storage Azure 数据工厂仅从 Blob 存储复制数据新添加的文件 - Azure Data Factory Copy Data From Blob Storage Only New Added file(s) Map Azure Blob 存储文件到 Azure 数据工厂中的自定义活动 - Map Azure Blob Storage files to a custom activity in Azure Data Factory 从Azure数据工厂访问Azure Blob存储帐户 - Access Azure Blob storage account from azure data factory 将最新的文件夹从 azure blob 存储加载到 azure 数据工厂 - Load the latest folder from azure blob storage to azure data factory Azure 数据工厂到 Azure Blob 存储权限 - Azure Data Factory to Azure Blob Storage Permissions 将 XML 压缩文件从 HTTP 链接源复制并提取到 Azure 使用 Z3A5805BC0FZ63F 工厂数据存储的 Blob 存储 - Copy and Extracting Zipped XML files from HTTP Link Source to Azure Blob Storage using Azure Data Factory Azure数据工厂-从Azure Blob存储读取文件夹中的所有文件时记录文件名 - Azure Data Factory - Recording file name when reading all files in folder from Azure Blob Storage 如何从 Azure 数据工厂重命名 Blob 存储文件? - How to rename a blob storage file from Azure Data Factory? 如何从 Azure 数据工厂中的 blob 存储中解压缩.gz 文件? - How to unzip .gz file from blob storage in Azure Data Factory?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM