简体   繁体   English

撤消/回滚数据处理管道的效果

[英]Undo/rollback the effects of a data processing pipeline

I have a workflow that I'll describe as follows: 我有一个工作流程,我将描述如下:

[  Dump(query)  ] ---+
                     |
                     +---> [ Parquet(dump, schema) ] ---> [ Hive(parquet) ]
                     |
[ Schema(query) ] ---+

Where: 哪里:

  • query is a query to an RDBMS query是对RDBMS的查询
  • Dump dumps the result query to a CSV file dump Dump将结果query转储到CSV文件dump
  • Schema runs the query and xcoms its schema schema Schema运行queryxcoms其架构schema
  • Parquet reads csv and uses schema to create a Parquet file parquet Parquet读取csv并使用schema创建Parquet文件parquet
  • Hive creates a Hive table based on the Parquet file parquet Hive基于Parquet文件parquet创建一个Hive表

The reason behind this somehow convoluted workflow are due to constraints that cannot be solved and lie outside of the scope of the question (but yeah, it would ideally be much simpler than this). 这种错综复杂的工作流程背后的原因是由于无法解决的限制并且超出了问题的范围(但是,理想情况下,它会比这简单得多)。

My question is about rolling back the effects of a pipeline in case of failure . 我的问题是在发生故障时回滚管道的影响

These are the rollbacks that I would like to see happen in different conditions: 这些是我希望在不同条件下发生的回滚:

  • dump should always be deleted, regardless the end result of the pipeline 无论管道的最终结果如何,都应始终删除dump
  • parquet should be deleted if, for whatever reason, the Hive table creation fails 如果由于某种原因,Hive表创建失败,则应删除parquet

Representing this in a workflow, I'd probably put it down like this: 在工作流程中代表这一点,我可能会这样说:

[  Dump(query)  ] ---+
                     |
                     +---> [ Parquet(dump, schema) ] ---> [ Hive(parquet) ]
                     |                |                          |
[ Schema(query) ] ---+                |                          |
                                      v                          v
                            [ DeleteParquetOutput ] --> [ DeleteDumpOutput ]

Where the transition from Parquet to DeleteParquetOutput is performed only if an error occurs and the transitions going into DeleteDumpOutput occur ignoring any failure from its dependencies. ParquetDeleteParquetOutput的转换仅在发生错误并且进入DeleteDumpOutput的转换发生时忽略其依赖项中的任何失败。

This should solve it, but I believe that more complex pipelines could suffer greatly in increased complexity by this error handling logic . 这应该解决它,但我相信更复杂的管道可能会因此错误处理逻辑而增加复杂性

Before moving on to more details, my question: could this be considered a good practice when it comes to dealing with errors in an Airflow pipeline? 在继续讨论更多细节之前,我的问题是: 在处理Airflow管道中的错误时,这可能被认为是一种很好的做法吗? What could be a different (and possibly more sustainable) approach? 什么可能是一种不同的(可能更可持续的)方法?

If you are further interested in how I would like to solve this, read on, otherwise feel free to answer and/or comment. 如果您对我希望如何解决此问题感兴趣,请继续阅读,否则随时回答和/或评论。


My take on error handling in a pipeline 我对管道中的错误处理有所了解

Ideally, what I'd like to do would be: 理想情况下,我想做的是:

  • define a rollback procedure for each stage where it's relevant 为每个相关的阶段定义回滚过程
  • for each rollback procedure, define whether it should only happen in case of failure or in any case 对于每个回滚过程,定义是否应该仅在发生故障或在任何情况下发生
  • when the pipeline completes, reverse the dependency relationships and, starting from the last successful tasks, traverse the reversed DAG and run the relevant rollback procedures (where applicable) 当管道完成时,反转依赖关系,并从上一个成功的任务开始,遍历反向的DAG并运行相关的回滚过程(如果适用)
  • errors from rollback procedures should be logged but not taken into account to complete the rollback of the whole pipeline 应记录回滚过程中的错误,但不要将其考虑在内以完成整个管道的回滚
  • for the previous point to hold, each task should define a single effect whose rollback procedure can be described without referencing other tasks 对于前一个要点,每个任务应定义一个效果,可以在不引用其他任务的情况下描述其回滚过程

Let's make a couple of examples with the given pipeline. 让我们用给定的管道做几个例子。

Scenario 1: Success 场景1:成功

We reverse the DAG and fill each task with its mandatory rollback procedure (if any), getting this 我们反转DAG并用其强制回滚过程(如果有的话)填充每个任务,得到这个

                                         +---> [ Dump: UNDO ]
                                         |
[ Hive: None ] ---> [ Parquet: None ] ---+
^                                        |
|                                        +---> [ Schema: None ]
+--- Start here

Scenario 2: Failure occurs at Hive 场景2: Hive发生故障

                                                 +---> [ Dump: UNDO ]
                                                 |
[ Hive: None ] ---> [ Parquet: UNDO (error) ] ---+
                    ^                            |
                    |                            +---> [ Schema: None ]
                    +--- Start here

Is there any way to represent something like this in Airflow? 有没有办法在Airflow中表示这样的东西? I would also be open to evaluating different workflow automation solutions, should they enable this kind of approach. 如果他们能够采用这种方法,我也愿意评估不同的工作流程自动化解决方案。

所有运算符和传感器派生自的BaseOperator类都支持回调: on_success_callbackon_retry_callbackon_failure_callback - 也许这些将有所帮助。

Seems like a complicated way to handle errors. 似乎是一种处理错误的复杂方法。 I think it's better to think of errors as simply stopping the current run of a DAG so that you can fix any issues and re-start it from where it left off. 我认为最好将错误视为简单地停止DAG的当前运行,以便您可以解决任何问题并从停止的位置重新启动它。 Sure you can clean up partially created files that were created by a particular task but I wouldn't wind back the entire pipeline just because of some downstream issue. 当然,您可以清理由特定任务创建的部分创建的文件,但由于某些下游问题,我不会收回整个管道。

Take for example what we do where I work, admittedly it's using different technologies but the same kind of workflow I think: 以我的工作为例,不可否认,它使用的是不同的技术,但我认为是同样的工作流程:

  1. Extract deltas from a source database for a specific interval period and zip it onto the Airlfow worker server 从源数据库中提取特定时间间隔的增量,并将其压缩到Airlfow工作服务器上
  2. Move this zipped file into an S3 location 将此压缩文件移动到S3位置
  3. Copy the S3 file into a Snowflake data warehouse. 将S3文件复制到Snowflake数据仓库中。

With our current setup - if someone accidentally changes the structure of the Snowflake table that we load S3 files into the only task that will fail is the last one (step 3) since the table structure no longer matches the CSV structure. 使用我们当前的设置 - 如果有人意外更改Snowflake表的结构,我们将S3文件加载到唯一将失败的任务是最后一个(步骤3),因为表结构不再与CSV结构匹配。 To fix this we simply need to revert the structure of the table back to what it was and re-run the task that failed. 要解决这个问题,我们只需要将表的结构恢复到原来的状态,然后重新运行失败的任务。 Airflow would then re-copy the file from S3 into Snowflake and succeed. 然后Airflow将文件从S3重新复制到Snowflake并成功。

With the setup that you propose what would happen? 你提出的设置将会发生什么? If the last task fails it would roll-back the entire pipeline and remove the CSV file from the s3 bucket; 如果最后一个任务失败,它将回滚整个管道并从s3存储桶中删除CSV文件; we would have to download the file from the source database again. 我们必须再次从源数据库下载该文件。 It would be better if we simply re-ran the task to copy from s3 into Snowflake saving the hassle of having to run the entire DAG. 如果我们简单地重新运行从s3复制到Snowflake的任务会更好,从而省去了必须运行整个DAG的麻烦。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM