简体   繁体   English

气流-在BranchPythonOperator中访问Xcom

[英]Airflow - Access Xcom in BranchPythonOperator

I have extensively searched for airflow blogs and documentation to debug a problem I have. 我广泛搜索了气流博客和文档,以调试遇到的问题。

What I am trying to solve 我要解决的问题

  1. Check if a particular file exists on an ftp server 检查ftp服务器上是否存在特定文件

  2. If it exists upload it to cloud 如果存在,将其上传到云

  3. If it doesn't exist, send an email to the client reporting that no file is found 如果不存在,请向客户端发送电子邮件,报告找不到文件

What I have 我有的

  1. A custom operator extending the BaseOperator that uses the SSH Hook and pushes a value (true or false). 一个自定义运算符,它扩展了使用SSH挂钩并推入值(真或假)的BaseOperator。

  2. Task that uses BranchPythonOperator to pull the value from xcom and check if previous task returned true or false and make the decision about the next task. 使用BranchPythonOperator从xcom提取值的任务,并检查上一个任务返回的是true还是false,并决定下一个任务。

Please look at the code below. 请查看下面的代码。 This code is a simplified version of what I am trying to do. 这段代码是我正在尝试做的简化版本。

If anyone is interested in my original code, please scroll down to the end of the question. 如果有人对我的原始代码感兴趣,请向下滚动到问题的末尾。

Here the custom operator simply returns a String Even or Odd, based on the minute being even or odd. 在此,自定义运算符仅根据分钟是偶数还是奇数来返回String偶数或奇数。

import logging

from airflow.models import BaseOperator
from airflow.plugins_manager import AirflowPlugin
from airflow.utils.decorators import apply_defaults
from datetime import datetime

log = logging.getLogger(__name__)

class MediumTestOperator(BaseOperator):

    @apply_defaults
    def __init__(self,
                 do_xcom_push=True,
                 *args,
                 **kwargs):
        super(MediumTestOperator, self).__init__(*args, **kwargs)
        self.do_xcom_push = do_xcom_push
        self.args = args
        self.kwargs = kwargs

    def execute(self, context):
        # from IPython import embed; embed()
        current_minute = datetime.now().minute

        context['ti'].xcom_push(key="Airflow", value="Apache Incubating")

        if current_minute %2 == 0:
            context['ti'].xcom_push(key="minute", value="Even")
        else:
            context['ti'].xcom_push(key="minute", value="Odd")
        # from IPython import embed; embed()


class MediumTestOperatorPlugin(AirflowPlugin):
    name = "medium_test"
    operators = [MediumTestOperator]

File: caller.py 文件:caller.py

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.python_operator import BranchPythonOperator
from airflow.operators.dummy_operator import DummyOperator
from medium_payen_op import MediumTestOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'guillaume',
    'depends_on_past': False,
    'start_date': datetime(2018, 6, 18),
    'email': ['hello@moonshots.ai'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=1)
}
dag = DAG(
    'Weekday',
    default_args=default_args,
    schedule_interval="@once")


sample_task = MediumTestOperator(
    task_id='task_1',
    provide_context=True,
    dag=dag
)


def get_branch_follow(**kwargs):
    x = kwargs['ti'].xcom_pull(task_ids='task_1', key="minute")
    print("From Kwargs: ", x)
    if x == 'Even':
        return 'task_3'
    else:
        return 'task_4'


task_2 = BranchPythonOperator(
    task_id='task_2_branch',
    python_callable=get_branch_follow,
    provide_context=True,
    dag=dag
)


def get_dample(**kwargs):
    x = kwargs['ti'].xcom_pull(task_ids='task_1', key="minute")
    y = kwargs['ti'].xcom_pull(task_ids='task_1', key="Airflow")
    print("Minute is:", x, " Airflow is from: ", y)
    print("Task 3 Running")


task_3 = PythonOperator(
    python_callable=get_dample,
    provide_context=True,
    dag=dag,
    task_id='task_3'
)


def get_dample(**kwargs):
    x = kwargs['ti'].xcom_pull(task_ids='task_1', key="minute")
    y = kwargs['ti'].xcom_pull(task_ids='task_1', key="Airflow")
    print("Minute is:", x, " Airflow is from: ", y)
    print("Task 4 Running")


task_4 = PythonOperator(
    python_callable=get_dample,
    provide_context=True,
    dag=dag,
    task_id='task_4'
)

sample_task >> task_3

task_2 >> task_3
task_2 >> task_4

As you can see from the attached images, the Xcom push did work and I can pull the values from PythonOperator but not from the BranchPythonOperator. 从所附的图像中可以看到,Xcom推送确实起作用,我可以从PythonOperator提取值,而不能从BranchPythonOperator提取值。

Any help is appreciated. 任何帮助表示赞赏。

Xcom Pull from inside the Python Callable of the BranchPythonOperator returns 'None' always, resulting in the Else block running always. 从BranchPythonOperator的Python可调用对象内部进行Xcom Pull总是返回“无”,从而导致Else块始终运行。 PythonBranchOperator日志-Xcom_Pull返回“无”

A Tree View of the DAG DAG的树视图 DAG的树视图

XCom Values from the Admin Screen 管理员屏幕上的XCom值 管理员屏幕上的XCom值

Xcom Pull from the PythonOperator returns proper values. 来自PythonOperator的Xcom Pull返回正确的值。 来自Python Operator的Xcom Pull,它可以正常工作

Xcom Pull-不同的价值


This is the original code that I am working with 这是我正在使用的原始代码

The custom operator pushes a string True or False as an Xcom Value which then read by the BranchPythonOperator. 自定义运算符将字符串True或False推送为Xcom值,然后由BranchPythonOperator读取。

I want to read the value pushed by a task created using the above custom operator inside of a BranchPythonOperator task and choose a different path based on the returned value. 我想读取由使用BranchPythonOperator任务内的上述自定义运算符创建的任务所推入的值,并根据返回的值选择其他路径。

File: check_file_exists_operator.py 档案:check_file_exists_operator.py

import logging
from tempfile import NamedTemporaryFile

from airflow.contrib.hooks.ssh_hook import SSHHook
from airflow.models import BaseOperator
from airflow.plugins_manager import AirflowPlugin
from airflow.utils.decorators import apply_defaults

log = logging.getLogger(__name__)


class CheckFileExistsOperator(BaseOperator):
    """
    This operator checks if a given file name exists on the
    the sftp server.

    Returns true if it exists, false otherwise.

    :param sftp_path_prefix: The sftp remote path. This is the specified file path
        for downloading the file from the SFTP server.
    :type sftp_path_prefix: string
    :param file_to_be_processed: File that is to be Searched 
    :type file_to_be_processed: str
    :param sftp_conn_id: The sftp connection id. The name or identifier for
        establishing a connection to the SFTP server.
    :type sftp_conn_id: string
    :param timeout: timeout (in seconds) for executing the command.
    :type timeout: int
    :param do_xcom_push: return the stdout which also get set in xcom by
           airflow platform
    :type do_xcom_push: bool

    """

    FORWARD_SLASH_LITERAL = '/'

    template_fields = ('file_to_be_processed',)

    @apply_defaults
    def __init__(self,
                 sftp_path_prefix,
                 file_to_be_processed,
                 sftp_conn_id='ssh_default',
                 timeout=10,
                 do_xcom_push=True,
                 *args,
                 **kwargs):
        super(CheckFileExistsOperator, self).__init__(*args, **kwargs)
        self.sftp_path_prefix = sftp_path_prefix
        self.file_to_be_processed = file_to_be_processed
        self.sftp_conn_id = sftp_conn_id
        self.timeout = timeout
        self.do_xcom_push = do_xcom_push
        self.args = args
        self.kwargs = kwargs

    def execute(self, context):

        # Refer to https://docs.paramiko.org/en/2.4/api/sftp.html
        ssh_hook = SSHHook(ssh_conn_id=self.sftp_conn_id)
        sftp_client = ssh_hook.get_conn().open_sftp()

        sftp_file_absolute_path = self.sftp_path_prefix.strip() + \
                                  self.FORWARD_SLASH_LITERAL + \
                                  self.file_to_be_processed.strip()

        task_instance = context['task_instance']

        log.debug('Checking if the follwoing file exists: %s', sftp_file_absolute_path)

        try:
            with NamedTemporaryFile("w") as temp_file:
                sftp_client.get(sftp_file_absolute_path, temp_file.name)

                # Return a string equivalent of the boolean.
                # Returning a boolean will make the key unreadable
                params = {'file_exists' : True}
                self.kwargs['params'] = params
                task_instance.xcom_push(key="file_exists", value='True')

                log.info('File Exists, returning True')

                return 'True'

        except FileNotFoundError:
            params = {'file_exists' : False}
            self.kwargs['params'] = params
            task_instance.xcom_push(key="file_exists", value='False')

            log.info('File Does not Exist, returning False')

            return 'False'


class CheckFilePlugin(AirflowPlugin):
    name = "check_file_exists"
    operators = [CheckFileExistsOperator]

File: airflow_dag_sample.py 文件:airflow_dag_sample.py

import logging

from airflow import DAG
from check_file_exists_operator import CheckFileExistsOperator
from airflow.contrib.operators.sftp_to_s3_operator import SFTPToS3Operator
from airflow.operators.python_operator import BranchPythonOperator
from datetime import timedelta, datetime
from dateutil.relativedelta import relativedelta
from airflow.operators.email_operator import EmailOperator

log = logging.getLogger(__name__)
FORWARD_SLASH_LITERAL = '/'

default_args = {
    'owner': 'gvatreya',
    'depends_on_past': False,
    'start_date': datetime(2019, 1, 1),
    'email': ['***@***.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 5,
    'retry_delay': timedelta(minutes=2),
    'timeout': 10,
    'sftp_conn_id': 'sftp_local_cluster',
    'provide_context': True
}

dag = DAG('my_test_dag',
          description='Another tutorial DAG',
          schedule_interval='0 12 * * *',
          start_date=datetime(2017, 3, 20),
          default_args=default_args,
          template_searchpath='/Users/your_name/some_path/airflow_home/sql',
          catchup=False)

template_filename_from_xcom = """
    {{ task_instance.xcom_pull(task_ids='get_fname_ships', key='file_to_be_processed', dag_id='my_test_dag') }}
"""

template_file_prefix_from_xcom = """
    {{ task_instance.xcom_pull(task_ids='get_fname_ships', key="month_prefix_for_file", dag_id='my_test_dag') }}
"""

t_check_file_exists = CheckFileExistsOperator(
    sftp_path_prefix='/toDjembe',
    file_to_be_processed=template_filename_from_xcom.strip(),
    sftp_conn_id='sftp_local_cluster',
    task_id='check_file_exists',
    dag=dag
)


def branch(**kwargs):
    file_exist = kwargs['task_instance'].xcom_pull(task_ids='get_fname_ships', key="file_exists",
                                                   dag_id='my_test_dag')
    print(template_filename_from_xcom)
    from IPython import embed; embed()
    log.debug("FILE_EXIST(from branch): %s", file_exist)
    if file_exist:
        return 's3_upload'
    else:
        return 'send_file_not_found_email'


t_branch_on_file_existence = BranchPythonOperator(
    task_id='branch_on_file_existence',
    python_callable=branch,
    dag=dag
)

t_send_file_not_found_email = EmailOperator(
    task_id='send_file_not_found_email',
    to='***@***.com',
    subject=template_email_subject.format(state='FAILURE',filename=template_filename_from_xcom.strip(),content='Not found on SFTP Server'),
    html_content='File Not Found in SFTP',
    mime_charset='utf-8',
    dag=dag
)

t_upload_to_s3 = SFTPToS3Operator(
    task_id='s3_upload',
    sftp_conn_id='sftp_local_cluster',
    sftp_path='/djembe/' + template_filename_from_xcom.strip(),
    s3_conn_id='s3_conn',
    s3_bucket='djembe-users',
    s3_key='gvatreya/experiment/' + template_file_prefix_from_xcom.strip() + FORWARD_SLASH_LITERAL + template_filename_from_xcom.strip(),
    dag=dag
)

t_check_file_exists >> t_branch_on_file_existence

t_branch_on_file_existence >> t_upload_to_s3
t_branch_on_file_existence >> t_send_file_not_found_email

However, when I run the code, the branch operator always sees the string 'None'. 但是,当我运行代码时,分支运算符始终会看到字符串“ None”。

However, the Xcom has the value true. 但是,Xcom的值为true。

I tried debugging using IPython embed() and see that the taskinstance doesnot hold the value of the xcom. 我尝试使用IPython embed()调试,发现taskinstance不包含xcom的值。 I tried using params, and other things that I could think of, but to no avail. 我尝试使用参数和其他我能想到的东西,但无济于事。

After spending days on this, I am now starting to think I have missed something crucial about the XCom in Airflow. 花了几天时间之后,我现在开始认为我已经错过了有关Airflow中XCom的重要信息。

Hoping anyone could help. 希望任何人都能提供帮助。

Thanks in advance. 提前致谢。

I think, the issue is with dependency. 我认为,问题在于依赖性。

You currently have the following: 您当前具有以下条件:

sample_task >> task_3

task_2 >> task_3
task_2 >> task_4

Change it to the following ie adding sample_task >> tasK_2 line. 将其更改为以下内容,即添加sample_task >> tasK_2行。

sample_task >> task_3
sample_task >> tasK_2

task_2 >> task_3
task_2 >> task_4

Your task that pushes to xcom should run first before the task that uses BranchPythonOperator 推送到xcom的任务应该在使用BranchPythonOperator的任务之前先运行

In your 2nd example, the branch function uses xcom_pull(task_ids='get_fname_ships' but I can't find any task with get_fname_ships task_id. 在您的第二个示例中, branch函数使用xcom_pull(task_ids='get_fname_ships'但我找不到任何带有get_fname_ships task_id的任务。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM