简体   繁体   English

计划活动在Azure数据工厂中如何工作

[英]How schedule activity works in Azure Data Factory

I am trying grasp the concept of Data Factory to understand how schedule activity works, but does not really understand much. 我正在尝试掌握数据工厂的概念,以了解计划活动的工作原理,但并不太了解。

Assume I have workflow as below: 假设我的工作流程如下:

  1. I have an agent (built as Windows Service) running on client's machine which is scheduled to extract data from SAP source daily at 1 AM, and then put it on Azure blob storage. 我在客户端计算机上运行了一个代理(作为Windows服务构建),该代理计划每天凌晨1点从SAP源中提取数据,然后将其放在Azure blob存储中。 Agent just tries to extract only yesterday's data. 代理仅尝试提取昨天的数据。 Example: Agent running at 1 AM today (9 April) only extract whole data on 8 April. 示例:今天(4月9日)凌晨1点运行的代理仅在4月8日提取全部数据。 This agent is not related to Data Factory. 该代理与数据工厂无关。

  2. Assume it takes around 30 minutes for agent to get daily data (8 April) and put it in blob storage, it may be more or less depending on how big data is. 假设代理获取每日数据(4月8日)大约需要30分钟,然后将其放入Blob存储中,这可能或多或少取决于数据量。

  3. I have a Factory Pipepine (active forever from 2016-04-08T01:30:00Z) which uses blob storage as input dataset and 1 schedule activity to copy data from blob storage to database. 我有一个Factory Pipepine(从2016-04-08T01:30:00Z 永久激活),它使用blob存储作为输入数据集和1个调度活动,以将数据从blob存储复制到数据库。

Input dataset has availability option as daily frequency: 输入数据集具有可用性选项作为每日频率:

"availability": {
  "frequency": "Day",
  "interval": 1
}

Schedule activity is scheduled as daily frequency: 日程安排活动按每日频率安排:

   "scheduler": {
      "frequency": "Day",
      "interval": 1
    }

So, based on the workflow, my questions are: 因此,根据工作流程,我的问题是:

  1. After 1:30 AM, the agent finish data extraction from SAP and put it into blog storage as input dataset. 上午1:30之后,代理完成从SAP的数据提取,并将其作为输入数据集放入博客存储中。 How the data factory knows the data slice for 8 April is ready for data factory. 数据工厂如何知道4月8日的数据切片已为数据工厂做好了准备。

  2. What if the data is not ready after 1:30, the activity is still running at this time? 如果1:30之后数据仍未准备好,该活动此时仍在运行,该怎么办?

If you have data in Azure Blob Storage that is appearing daily, you can try using Date folders (eg: .../yyyy/MM/dd/...). 如果您每天在Azure Blob存储中有数据,则可以尝试使用日期文件夹(例如:... / yyyy / MM / dd / ...)。 Data Factory can detect whether a particular date folder exists to determine whether the slice for the particular day is ready for processing or not. Data Factory可以检测特定日期文件夹是否存在,以确定特定日期的切片是否已准备好进行处理。 If Data Factory doesn't see the folder for that day, it will not execute the pipeline for that slice. 如果Data Factory当天没有看到该文件夹​​,则它将不会对该切片执行管道。

I would also suggest including the extraction process as a part of the Data Factory processing so that if the extraction fails, the pipeline will not be executed further. 我还建议将提取过程作为数据工厂处理的一部分,以便如果提取失败,则不会进一步执行管道。

I hope this helps! 我希望这有帮助!

If I understand your particular scenario correctly, and you have access to modify the code of the windows service you can have your windows service kick off the ADF pipeline when it is complete. 如果我正确理解了您的特定情况,并且您有权修改Windows服务的代码,则可以在Windows服务完成时启动Windows服务。 I am doing something exactly like this and I need to control when my pipeline begins. 我正在做完全像这样的事情,我需要控制管道何时开始。 I have a local job pulling data from a few data sources and putting it into an azure sql db. 我有一个本地作业,它从一些数据源中提取数据,并将其放入Azure SQL数据库中。 Once that is complete I need my pipeline to start however there was no way for me to know exactly when my job was going to complete. 一旦完成,我就需要启动管道,但是我无法确切地知道何时完成工作。 So the final step of my local job is to kick off my ADF pipeline. 因此,本地工作的最后一步是启动ADF管道。 I have a write up on how to do it here - Starting an azure data factory pipeline from .net . 我在这里写了一篇关于如何做的文章-从.net启动一个Azure数据工厂管道

Hope this helps. 希望这可以帮助。

To my knowledge, Azure Data Factory does not currently support triggering a pipeline by creating or updating a blob. 据我所知,Azure数据工厂当前不支持通过创建或更新Blob来触发管道。

For this workflow, the solution is to schedule the input dataset based on time. 对于此工作流程,解决方案是根据时间安排输入数据集。 If you're confident that the data extraction will be complete by 1:30 AM, then you can schedule the job to run daily at 1:30 AM (or perhaps a little bit later, in case the extraction runs long.) To do this, set your pipeline's start time to be something like "2016-04-08T01:30:00Z" (for UTC time.) You should be able to author the input dataset in such a way that the job will fail if the data extraction is not yet complete, which would allow you to notice the failure and rerun it. 如果您确信数据提取将在1:30 AM之前完成,则可以将作业安排为每天凌晨1:30运行(或者可能要稍后一点,以防提取时间长)。这样,将管道的开始时间设置为类似于“ 2016-04-08T01:30:00Z”(对于UTC时间)。您应该能够以这样的方式创作输入数据集:如果数据提取,作业将失败尚未完成,这将使您注意到故障并重新运行。 The activity will start when you schedule it to, and will complete as soon as possible. 该活动将在您安排进行时开始,并且将尽快完成。 See this page for details on moving data between Azure Blob and Azure SQL. 有关在Azure Blob和Azure SQL之间移动数据的详细信息,请参见此页面。 Your workflow would look very similar to the example at that link, only your frequency would be be "Day". 您的工作流程与该链接上的示例非常相似,只是您的频率为“天”。

Depending on how the local data is stored, it may be worth looking into moving the data directly from your on-prem source, bypassing Azure Blob. 根据本地数据的存储方式,可能有必要考虑绕过Azure Blob直接从本地源移动数据。 This is supported using a Data Management Gateway, as documented here. 如此处所述,使用数据管理网关可以对此进行支持。 Unfortunately, I'm not familiar with SAP, so I can't offer more information about that. 不幸的是,我对SAP不熟悉,因此无法提供有关它的更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM