简体   繁体   中英

Airflow how to catch errors from an Operator outside of the operator

Maybe the question isn't phrased in the best way. Basically what I want to do is: Building a DAG that iterates over a list of sql files and using the BigQueryOperator() to execute these sql files.

However there will be sql files in the list where the corresponding tables do not exist in BQ . And I want to catch these kind of errors (they should not be printed in the log and the task should not be marked as failed. However the errors should be added to a dictionary and shown by a different task that runs at the end.

As you can see by the code below I tried to fetch the error from the BigQueryOperator with try & except however this does not work. The task get's executed correctly and it runs the sql's fine but if there is an error it will immediately print out the error in the log and mark the task as failed and the try & except clause is completely ignored . Also the last task print_errors() does not print anything as the dictionary is empty. So to me it looks like if I can't influence an Airflow Operator once it is called as it ignores python logic that is wrapped around the operator

My current code looks as follows:

Importing some libraries:

import airflow
from airflow import models
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime, timedelta

Get some variables + the list of sql files (hardcoded for now but will be fetched from GCP storage later on)

BQ_PROJECT = models.Variable.get('bq_datahub_project_id').strip()
BQ_DATASET_PREFIX = models.Variable.get('datasetPrefix').strip()
AIRFLOW_BUCKET = models.Variable.get('airflow_bucket').strip()
CODE =  models.Variable.get('gcs_code').strip()
COMPOSER = '-'.join(models.Variable.get('airflow_bucket').strip().split('-')[2:-2])

DAG

with models.DAG(dag_id='run-sql-files',
                schedule_interval='0 */3 * * *',
                user_defined_macros={"COMPOSER": COMPOSER},
                default_args=default_dag_args,
                concurrency=2,
                max_active_runs=1,
                catchup=False) as dag:
    def print_errors():
        if other_errors:
            for error, files in other_errors.items():
                print("Warning: " + error + " for the following SQL files:")
                for file in files:
                    print(file)
    t0 = DummyOperator(
        task_id='Start'
    )
    t2 = DummyOperator(
        task_id='End',
        trigger_rule='all_done'
    )
    other_errors = {}
    for i, sql_file in enumerate(sql_files):
        try:
            full_path_sql = AIRFLOW_BUCKET + sql_file
            t1 = BigQueryOperator(
                task_id='sql_'+str(i),
                params={"datahubProject": BQ_PROJECT, "datasetPrefix": BQ_DATASET_PREFIX,},
                sql=sql_file,
                use_legacy_sql=False,
                location='europe-west3',
                labels={ "composer_id": COMPOSER, "dag_id": "{{ dag.dag_id }}", "task_id": "{{ task.task_id }}"},
                dag=dag
                )
            t0 >> t1 >> t2
        except Exception as e:
            other_errors[str(e)] = other_errors.get(str(e), []) + [sql_file]
    t3 = PythonOperator(
        task_id='print_errors',
        python_callable=print_errors,
        provide_context=True,
        dag=dag)
    t2 >> t3

To solve your issue, I propose you to use a PythonOperator with a BigQuery Python client in order to have more flexibility and catch errors more easily:

def execute_queries():
   client = bigquery.Client()
   
   for i, sql_file in enumerate(sql_files):
     
     # check if you have to recovers the string sql query from the current sql file
     query = sql_file

     query_job = client.query(query)
   
     try:
       job.result()
     except Exception as e:
       for e in job.errors:
         # apply your logic
         logging.error('ERROR: {}'.format(e['message']))
   
   
with models.DAG(dag_id='run-sql-files',
                schedule_interval='0 */3 * * *',
                user_defined_macros={"COMPOSER": COMPOSER},
                default_args=default_dag_args,
                concurrency=2,
                max_active_runs=1,
                catchup=False) as dag:
   execute_queries_task = PythonOperator(
      task_id="task",
      python_callable=execute_queries
   )
   
   execute_queries_task

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM