简体   繁体   English

如何制作需要处理今天数据的DAG?

[英]How to make a DAG that needs to process the data from today?

I have a DAG that starts at 7:30pm every day. 我有一个DAG,每天7:30 pm开始。 It needs to process the files located in /data/yyyy-mm-dd/ directory. 它需要处理/ data / yyyy-mm-dd /目录中的文件。 yyyy-mm-dd is that same day. yyyy-mm-dd是同一天。

If I use execution_date + timedelta(day=1) it works when the DAG is ran by the scheduler. 如果我使用execution_date + timedelta(day = 1),则在调度程序运行DAG时可以使用。 But this breaks when I use the backfill command (I have to give it 2019-01-01 to run for 2019-01-02) 但这在我使用backfill命令时中断了(我必须给它2019-01-01才能运行2019-01-02)

Is there a better way to accomplish this? 有没有更好的方法可以做到这一点?

Your question sounds a little confused about the execution_date for backfills. 您的问题听起来有点困惑关于补余的execution_date The backfill command asks you to specify the alternate start and end dates to run the DAG in. It then uses the schedule_interval to figure out runs that would have run in that range and passes them their execution_date . backfill命令要求您指定运行DAG的备用开始日期和结束日期。然后,它使用schedule_interval找出在该范围内将要运行的运行,并将其execution_date传递给他们。

So, your schedule_interval probably looks like 30 19 * * * . 因此,您的schedule_interval可能看起来像30 19 * * * And as you know your run is passed the start of the interval at the closing of that interval, so a scheduled execution_date of 2019-01-01T19:30:00.000 will be triggered to start after 2019-01-02T19:30:00.000. 如您所知,您的跑步在该时间间隔结束时通过了该时间间隔的开始,因此预定触发的execution_date日期2019-01-01T19:30:00.000将在2019-01-02T19:30:00.000之后触发。 It seems at that time you want the job to pick up data that landed in /data/2019-01-02/ which is why you're adding a day to the execution_date and formatting it for the source. 似乎当时您想让工作来拾取/data/2019-01-02/中着陆的数据,这就是为什么要在execution_date添加一天并将其格式化为源的原因。

If you're backfill ing, it should behave the same way (rather than shifting time around). 如果您要回填 ,则回填的行为应相同(而不是转移时间)。 So given -s 2019-01-01 -e 2019-01-02 it's going to backfill a run that would have been triggered after 2019-01-02T19:30:00.000 with the execution date of 2019-01-01T19:30:00.000 isn't it? 因此,鉴于-s 2019-01-01 -e 2019-01-02它将回填在2019-01-02T19:30:00.000之后,执行日期为2019-01-01T19:30之后触发的运行: 00.000是吗?

As for other ways to do this: 至于其他方式可以做到这一点:

  • You could move your runs to midnight and have them use the date in the execution_date . 您可以将您的移动运行到午夜,并让他们使用日期在execution_date But 4.5h delay is probably not what you had in mind. 但是4.5小时的延迟可能不是您所想的。
  • You see if the data directories could be named differently, I doubt that would be okay if there's other people or jobs relying on them. 您会看到数据目录是否可以用不同的名称命名,我怀疑如果还有其他人或工作依赖于它们,这是否可以。
  • Airflow also has a next_execution_date , which is basically going to give you the same result as adding a day to the execution_date . 气流也有next_execution_date ,这基本上是想给你同样的结果将每天的execution_date But you might like the formatted macro {{ next_ds }} for your needs. 但是您可能需要格式化的 {{ next_ds }}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM