简体   繁体   中英

Airflow - Access Xcom in BranchPythonOperator

I have extensively searched for airflow blogs and documentation to debug a problem I have.

What I am trying to solve

  1. Check if a particular file exists on an ftp server

  2. If it exists upload it to cloud

  3. If it doesn't exist, send an email to the client reporting that no file is found

What I have

  1. A custom operator extending the BaseOperator that uses the SSH Hook and pushes a value (true or false).

  2. Task that uses BranchPythonOperator to pull the value from xcom and check if previous task returned true or false and make the decision about the next task.

Please look at the code below. This code is a simplified version of what I am trying to do.

If anyone is interested in my original code, please scroll down to the end of the question.

Here the custom operator simply returns a String Even or Odd, based on the minute being even or odd.

import logging

from airflow.models import BaseOperator
from airflow.plugins_manager import AirflowPlugin
from airflow.utils.decorators import apply_defaults
from datetime import datetime

log = logging.getLogger(__name__)

class MediumTestOperator(BaseOperator):

    @apply_defaults
    def __init__(self,
                 do_xcom_push=True,
                 *args,
                 **kwargs):
        super(MediumTestOperator, self).__init__(*args, **kwargs)
        self.do_xcom_push = do_xcom_push
        self.args = args
        self.kwargs = kwargs

    def execute(self, context):
        # from IPython import embed; embed()
        current_minute = datetime.now().minute

        context['ti'].xcom_push(key="Airflow", value="Apache Incubating")

        if current_minute %2 == 0:
            context['ti'].xcom_push(key="minute", value="Even")
        else:
            context['ti'].xcom_push(key="minute", value="Odd")
        # from IPython import embed; embed()


class MediumTestOperatorPlugin(AirflowPlugin):
    name = "medium_test"
    operators = [MediumTestOperator]

File: caller.py

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.python_operator import BranchPythonOperator
from airflow.operators.dummy_operator import DummyOperator
from medium_payen_op import MediumTestOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'guillaume',
    'depends_on_past': False,
    'start_date': datetime(2018, 6, 18),
    'email': ['hello@moonshots.ai'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=1)
}
dag = DAG(
    'Weekday',
    default_args=default_args,
    schedule_interval="@once")


sample_task = MediumTestOperator(
    task_id='task_1',
    provide_context=True,
    dag=dag
)


def get_branch_follow(**kwargs):
    x = kwargs['ti'].xcom_pull(task_ids='task_1', key="minute")
    print("From Kwargs: ", x)
    if x == 'Even':
        return 'task_3'
    else:
        return 'task_4'


task_2 = BranchPythonOperator(
    task_id='task_2_branch',
    python_callable=get_branch_follow,
    provide_context=True,
    dag=dag
)


def get_dample(**kwargs):
    x = kwargs['ti'].xcom_pull(task_ids='task_1', key="minute")
    y = kwargs['ti'].xcom_pull(task_ids='task_1', key="Airflow")
    print("Minute is:", x, " Airflow is from: ", y)
    print("Task 3 Running")


task_3 = PythonOperator(
    python_callable=get_dample,
    provide_context=True,
    dag=dag,
    task_id='task_3'
)


def get_dample(**kwargs):
    x = kwargs['ti'].xcom_pull(task_ids='task_1', key="minute")
    y = kwargs['ti'].xcom_pull(task_ids='task_1', key="Airflow")
    print("Minute is:", x, " Airflow is from: ", y)
    print("Task 4 Running")


task_4 = PythonOperator(
    python_callable=get_dample,
    provide_context=True,
    dag=dag,
    task_id='task_4'
)

sample_task >> task_3

task_2 >> task_3
task_2 >> task_4

As you can see from the attached images, the Xcom push did work and I can pull the values from PythonOperator but not from the BranchPythonOperator.

Any help is appreciated.

Xcom Pull from inside the Python Callable of the BranchPythonOperator returns 'None' always, resulting in the Else block running always. PythonBranchOperator日志-Xcom_Pull返回“无”

A Tree View of the DAG DAG的树视图

XCom Values from the Admin Screen 管理员屏幕上的XCom值

Xcom Pull from the PythonOperator returns proper values. 来自Python Operator的Xcom Pull,它可以正常工作

Xcom Pull-不同的价值


This is the original code that I am working with

The custom operator pushes a string True or False as an Xcom Value which then read by the BranchPythonOperator.

I want to read the value pushed by a task created using the above custom operator inside of a BranchPythonOperator task and choose a different path based on the returned value.

File: check_file_exists_operator.py

import logging
from tempfile import NamedTemporaryFile

from airflow.contrib.hooks.ssh_hook import SSHHook
from airflow.models import BaseOperator
from airflow.plugins_manager import AirflowPlugin
from airflow.utils.decorators import apply_defaults

log = logging.getLogger(__name__)


class CheckFileExistsOperator(BaseOperator):
    """
    This operator checks if a given file name exists on the
    the sftp server.

    Returns true if it exists, false otherwise.

    :param sftp_path_prefix: The sftp remote path. This is the specified file path
        for downloading the file from the SFTP server.
    :type sftp_path_prefix: string
    :param file_to_be_processed: File that is to be Searched 
    :type file_to_be_processed: str
    :param sftp_conn_id: The sftp connection id. The name or identifier for
        establishing a connection to the SFTP server.
    :type sftp_conn_id: string
    :param timeout: timeout (in seconds) for executing the command.
    :type timeout: int
    :param do_xcom_push: return the stdout which also get set in xcom by
           airflow platform
    :type do_xcom_push: bool

    """

    FORWARD_SLASH_LITERAL = '/'

    template_fields = ('file_to_be_processed',)

    @apply_defaults
    def __init__(self,
                 sftp_path_prefix,
                 file_to_be_processed,
                 sftp_conn_id='ssh_default',
                 timeout=10,
                 do_xcom_push=True,
                 *args,
                 **kwargs):
        super(CheckFileExistsOperator, self).__init__(*args, **kwargs)
        self.sftp_path_prefix = sftp_path_prefix
        self.file_to_be_processed = file_to_be_processed
        self.sftp_conn_id = sftp_conn_id
        self.timeout = timeout
        self.do_xcom_push = do_xcom_push
        self.args = args
        self.kwargs = kwargs

    def execute(self, context):

        # Refer to https://docs.paramiko.org/en/2.4/api/sftp.html
        ssh_hook = SSHHook(ssh_conn_id=self.sftp_conn_id)
        sftp_client = ssh_hook.get_conn().open_sftp()

        sftp_file_absolute_path = self.sftp_path_prefix.strip() + \
                                  self.FORWARD_SLASH_LITERAL + \
                                  self.file_to_be_processed.strip()

        task_instance = context['task_instance']

        log.debug('Checking if the follwoing file exists: %s', sftp_file_absolute_path)

        try:
            with NamedTemporaryFile("w") as temp_file:
                sftp_client.get(sftp_file_absolute_path, temp_file.name)

                # Return a string equivalent of the boolean.
                # Returning a boolean will make the key unreadable
                params = {'file_exists' : True}
                self.kwargs['params'] = params
                task_instance.xcom_push(key="file_exists", value='True')

                log.info('File Exists, returning True')

                return 'True'

        except FileNotFoundError:
            params = {'file_exists' : False}
            self.kwargs['params'] = params
            task_instance.xcom_push(key="file_exists", value='False')

            log.info('File Does not Exist, returning False')

            return 'False'


class CheckFilePlugin(AirflowPlugin):
    name = "check_file_exists"
    operators = [CheckFileExistsOperator]

File: airflow_dag_sample.py

import logging

from airflow import DAG
from check_file_exists_operator import CheckFileExistsOperator
from airflow.contrib.operators.sftp_to_s3_operator import SFTPToS3Operator
from airflow.operators.python_operator import BranchPythonOperator
from datetime import timedelta, datetime
from dateutil.relativedelta import relativedelta
from airflow.operators.email_operator import EmailOperator

log = logging.getLogger(__name__)
FORWARD_SLASH_LITERAL = '/'

default_args = {
    'owner': 'gvatreya',
    'depends_on_past': False,
    'start_date': datetime(2019, 1, 1),
    'email': ['***@***.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 5,
    'retry_delay': timedelta(minutes=2),
    'timeout': 10,
    'sftp_conn_id': 'sftp_local_cluster',
    'provide_context': True
}

dag = DAG('my_test_dag',
          description='Another tutorial DAG',
          schedule_interval='0 12 * * *',
          start_date=datetime(2017, 3, 20),
          default_args=default_args,
          template_searchpath='/Users/your_name/some_path/airflow_home/sql',
          catchup=False)

template_filename_from_xcom = """
    {{ task_instance.xcom_pull(task_ids='get_fname_ships', key='file_to_be_processed', dag_id='my_test_dag') }}
"""

template_file_prefix_from_xcom = """
    {{ task_instance.xcom_pull(task_ids='get_fname_ships', key="month_prefix_for_file", dag_id='my_test_dag') }}
"""

t_check_file_exists = CheckFileExistsOperator(
    sftp_path_prefix='/toDjembe',
    file_to_be_processed=template_filename_from_xcom.strip(),
    sftp_conn_id='sftp_local_cluster',
    task_id='check_file_exists',
    dag=dag
)


def branch(**kwargs):
    file_exist = kwargs['task_instance'].xcom_pull(task_ids='get_fname_ships', key="file_exists",
                                                   dag_id='my_test_dag')
    print(template_filename_from_xcom)
    from IPython import embed; embed()
    log.debug("FILE_EXIST(from branch): %s", file_exist)
    if file_exist:
        return 's3_upload'
    else:
        return 'send_file_not_found_email'


t_branch_on_file_existence = BranchPythonOperator(
    task_id='branch_on_file_existence',
    python_callable=branch,
    dag=dag
)

t_send_file_not_found_email = EmailOperator(
    task_id='send_file_not_found_email',
    to='***@***.com',
    subject=template_email_subject.format(state='FAILURE',filename=template_filename_from_xcom.strip(),content='Not found on SFTP Server'),
    html_content='File Not Found in SFTP',
    mime_charset='utf-8',
    dag=dag
)

t_upload_to_s3 = SFTPToS3Operator(
    task_id='s3_upload',
    sftp_conn_id='sftp_local_cluster',
    sftp_path='/djembe/' + template_filename_from_xcom.strip(),
    s3_conn_id='s3_conn',
    s3_bucket='djembe-users',
    s3_key='gvatreya/experiment/' + template_file_prefix_from_xcom.strip() + FORWARD_SLASH_LITERAL + template_filename_from_xcom.strip(),
    dag=dag
)

t_check_file_exists >> t_branch_on_file_existence

t_branch_on_file_existence >> t_upload_to_s3
t_branch_on_file_existence >> t_send_file_not_found_email

However, when I run the code, the branch operator always sees the string 'None'.

However, the Xcom has the value true.

I tried debugging using IPython embed() and see that the taskinstance doesnot hold the value of the xcom. I tried using params, and other things that I could think of, but to no avail.

After spending days on this, I am now starting to think I have missed something crucial about the XCom in Airflow.

Hoping anyone could help.

Thanks in advance.

I think, the issue is with dependency.

You currently have the following:

sample_task >> task_3

task_2 >> task_3
task_2 >> task_4

Change it to the following ie adding sample_task >> tasK_2 line.

sample_task >> task_3
sample_task >> tasK_2

task_2 >> task_3
task_2 >> task_4

Your task that pushes to xcom should run first before the task that uses BranchPythonOperator

In your 2nd example, the branch function uses xcom_pull(task_ids='get_fname_ships' but I can't find any task with get_fname_ships task_id.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM