[英]Airflow - Access Xcom in BranchPythonOperator
I have extensively searched for airflow blogs and documentation to debug a problem I have. 我广泛搜索了气流博客和文档,以调试遇到的问题。
What I am trying to solve 我要解决的问题
Check if a particular file exists on an ftp server 检查ftp服务器上是否存在特定文件
If it exists upload it to cloud 如果存在,将其上传到云
If it doesn't exist, send an email to the client reporting that no file is found 如果不存在,请向客户端发送电子邮件,报告找不到文件
What I have 我有的
A custom operator extending the BaseOperator that uses the SSH Hook and pushes a value (true or false). 一个自定义运算符,它扩展了使用SSH挂钩并推入值(真或假)的BaseOperator。
Task that uses BranchPythonOperator to pull the value from xcom and check if previous task returned true or false and make the decision about the next task. 使用BranchPythonOperator从xcom提取值的任务,并检查上一个任务返回的是true还是false,并决定下一个任务。
Please look at the code below. 请查看下面的代码。 This code is a simplified version of what I am trying to do.
这段代码是我正在尝试做的简化版本。
If anyone is interested in my original code, please scroll down to the end of the question. 如果有人对我的原始代码感兴趣,请向下滚动到问题的末尾。
Here the custom operator simply returns a String Even or Odd, based on the minute being even or odd. 在此,自定义运算符仅根据分钟是偶数还是奇数来返回String偶数或奇数。
import logging
from airflow.models import BaseOperator
from airflow.plugins_manager import AirflowPlugin
from airflow.utils.decorators import apply_defaults
from datetime import datetime
log = logging.getLogger(__name__)
class MediumTestOperator(BaseOperator):
@apply_defaults
def __init__(self,
do_xcom_push=True,
*args,
**kwargs):
super(MediumTestOperator, self).__init__(*args, **kwargs)
self.do_xcom_push = do_xcom_push
self.args = args
self.kwargs = kwargs
def execute(self, context):
# from IPython import embed; embed()
current_minute = datetime.now().minute
context['ti'].xcom_push(key="Airflow", value="Apache Incubating")
if current_minute %2 == 0:
context['ti'].xcom_push(key="minute", value="Even")
else:
context['ti'].xcom_push(key="minute", value="Odd")
# from IPython import embed; embed()
class MediumTestOperatorPlugin(AirflowPlugin):
name = "medium_test"
operators = [MediumTestOperator]
File: caller.py 文件:caller.py
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.python_operator import BranchPythonOperator
from airflow.operators.dummy_operator import DummyOperator
from medium_payen_op import MediumTestOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'guillaume',
'depends_on_past': False,
'start_date': datetime(2018, 6, 18),
'email': ['hello@moonshots.ai'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1)
}
dag = DAG(
'Weekday',
default_args=default_args,
schedule_interval="@once")
sample_task = MediumTestOperator(
task_id='task_1',
provide_context=True,
dag=dag
)
def get_branch_follow(**kwargs):
x = kwargs['ti'].xcom_pull(task_ids='task_1', key="minute")
print("From Kwargs: ", x)
if x == 'Even':
return 'task_3'
else:
return 'task_4'
task_2 = BranchPythonOperator(
task_id='task_2_branch',
python_callable=get_branch_follow,
provide_context=True,
dag=dag
)
def get_dample(**kwargs):
x = kwargs['ti'].xcom_pull(task_ids='task_1', key="minute")
y = kwargs['ti'].xcom_pull(task_ids='task_1', key="Airflow")
print("Minute is:", x, " Airflow is from: ", y)
print("Task 3 Running")
task_3 = PythonOperator(
python_callable=get_dample,
provide_context=True,
dag=dag,
task_id='task_3'
)
def get_dample(**kwargs):
x = kwargs['ti'].xcom_pull(task_ids='task_1', key="minute")
y = kwargs['ti'].xcom_pull(task_ids='task_1', key="Airflow")
print("Minute is:", x, " Airflow is from: ", y)
print("Task 4 Running")
task_4 = PythonOperator(
python_callable=get_dample,
provide_context=True,
dag=dag,
task_id='task_4'
)
sample_task >> task_3
task_2 >> task_3
task_2 >> task_4
As you can see from the attached images, the Xcom push did work and I can pull the values from PythonOperator but not from the BranchPythonOperator. 从所附的图像中可以看到,Xcom推送确实起作用,我可以从PythonOperator提取值,而不能从BranchPythonOperator提取值。
Any help is appreciated. 任何帮助表示赞赏。
Xcom Pull from inside the Python Callable of the BranchPythonOperator returns 'None' always, resulting in the Else block running always. 从BranchPythonOperator的Python可调用对象内部进行Xcom Pull总是返回“无”,从而导致Else块始终运行。
A Tree View of the DAG DAG的树视图
XCom Values from the Admin Screen 管理员屏幕上的XCom值
Xcom Pull from the PythonOperator returns proper values. 来自PythonOperator的Xcom Pull返回正确的值。
This is the original code that I am working with 这是我正在使用的原始代码
The custom operator pushes a string True or False as an Xcom Value which then read by the BranchPythonOperator. 自定义运算符将字符串True或False推送为Xcom值,然后由BranchPythonOperator读取。
I want to read the value pushed by a task created using the above custom operator inside of a BranchPythonOperator task and choose a different path based on the returned value. 我想读取由使用BranchPythonOperator任务内的上述自定义运算符创建的任务所推入的值,并根据返回的值选择其他路径。
File: check_file_exists_operator.py 档案:check_file_exists_operator.py
import logging
from tempfile import NamedTemporaryFile
from airflow.contrib.hooks.ssh_hook import SSHHook
from airflow.models import BaseOperator
from airflow.plugins_manager import AirflowPlugin
from airflow.utils.decorators import apply_defaults
log = logging.getLogger(__name__)
class CheckFileExistsOperator(BaseOperator):
"""
This operator checks if a given file name exists on the
the sftp server.
Returns true if it exists, false otherwise.
:param sftp_path_prefix: The sftp remote path. This is the specified file path
for downloading the file from the SFTP server.
:type sftp_path_prefix: string
:param file_to_be_processed: File that is to be Searched
:type file_to_be_processed: str
:param sftp_conn_id: The sftp connection id. The name or identifier for
establishing a connection to the SFTP server.
:type sftp_conn_id: string
:param timeout: timeout (in seconds) for executing the command.
:type timeout: int
:param do_xcom_push: return the stdout which also get set in xcom by
airflow platform
:type do_xcom_push: bool
"""
FORWARD_SLASH_LITERAL = '/'
template_fields = ('file_to_be_processed',)
@apply_defaults
def __init__(self,
sftp_path_prefix,
file_to_be_processed,
sftp_conn_id='ssh_default',
timeout=10,
do_xcom_push=True,
*args,
**kwargs):
super(CheckFileExistsOperator, self).__init__(*args, **kwargs)
self.sftp_path_prefix = sftp_path_prefix
self.file_to_be_processed = file_to_be_processed
self.sftp_conn_id = sftp_conn_id
self.timeout = timeout
self.do_xcom_push = do_xcom_push
self.args = args
self.kwargs = kwargs
def execute(self, context):
# Refer to https://docs.paramiko.org/en/2.4/api/sftp.html
ssh_hook = SSHHook(ssh_conn_id=self.sftp_conn_id)
sftp_client = ssh_hook.get_conn().open_sftp()
sftp_file_absolute_path = self.sftp_path_prefix.strip() + \
self.FORWARD_SLASH_LITERAL + \
self.file_to_be_processed.strip()
task_instance = context['task_instance']
log.debug('Checking if the follwoing file exists: %s', sftp_file_absolute_path)
try:
with NamedTemporaryFile("w") as temp_file:
sftp_client.get(sftp_file_absolute_path, temp_file.name)
# Return a string equivalent of the boolean.
# Returning a boolean will make the key unreadable
params = {'file_exists' : True}
self.kwargs['params'] = params
task_instance.xcom_push(key="file_exists", value='True')
log.info('File Exists, returning True')
return 'True'
except FileNotFoundError:
params = {'file_exists' : False}
self.kwargs['params'] = params
task_instance.xcom_push(key="file_exists", value='False')
log.info('File Does not Exist, returning False')
return 'False'
class CheckFilePlugin(AirflowPlugin):
name = "check_file_exists"
operators = [CheckFileExistsOperator]
File: airflow_dag_sample.py 文件:airflow_dag_sample.py
import logging
from airflow import DAG
from check_file_exists_operator import CheckFileExistsOperator
from airflow.contrib.operators.sftp_to_s3_operator import SFTPToS3Operator
from airflow.operators.python_operator import BranchPythonOperator
from datetime import timedelta, datetime
from dateutil.relativedelta import relativedelta
from airflow.operators.email_operator import EmailOperator
log = logging.getLogger(__name__)
FORWARD_SLASH_LITERAL = '/'
default_args = {
'owner': 'gvatreya',
'depends_on_past': False,
'start_date': datetime(2019, 1, 1),
'email': ['***@***.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 5,
'retry_delay': timedelta(minutes=2),
'timeout': 10,
'sftp_conn_id': 'sftp_local_cluster',
'provide_context': True
}
dag = DAG('my_test_dag',
description='Another tutorial DAG',
schedule_interval='0 12 * * *',
start_date=datetime(2017, 3, 20),
default_args=default_args,
template_searchpath='/Users/your_name/some_path/airflow_home/sql',
catchup=False)
template_filename_from_xcom = """
{{ task_instance.xcom_pull(task_ids='get_fname_ships', key='file_to_be_processed', dag_id='my_test_dag') }}
"""
template_file_prefix_from_xcom = """
{{ task_instance.xcom_pull(task_ids='get_fname_ships', key="month_prefix_for_file", dag_id='my_test_dag') }}
"""
t_check_file_exists = CheckFileExistsOperator(
sftp_path_prefix='/toDjembe',
file_to_be_processed=template_filename_from_xcom.strip(),
sftp_conn_id='sftp_local_cluster',
task_id='check_file_exists',
dag=dag
)
def branch(**kwargs):
file_exist = kwargs['task_instance'].xcom_pull(task_ids='get_fname_ships', key="file_exists",
dag_id='my_test_dag')
print(template_filename_from_xcom)
from IPython import embed; embed()
log.debug("FILE_EXIST(from branch): %s", file_exist)
if file_exist:
return 's3_upload'
else:
return 'send_file_not_found_email'
t_branch_on_file_existence = BranchPythonOperator(
task_id='branch_on_file_existence',
python_callable=branch,
dag=dag
)
t_send_file_not_found_email = EmailOperator(
task_id='send_file_not_found_email',
to='***@***.com',
subject=template_email_subject.format(state='FAILURE',filename=template_filename_from_xcom.strip(),content='Not found on SFTP Server'),
html_content='File Not Found in SFTP',
mime_charset='utf-8',
dag=dag
)
t_upload_to_s3 = SFTPToS3Operator(
task_id='s3_upload',
sftp_conn_id='sftp_local_cluster',
sftp_path='/djembe/' + template_filename_from_xcom.strip(),
s3_conn_id='s3_conn',
s3_bucket='djembe-users',
s3_key='gvatreya/experiment/' + template_file_prefix_from_xcom.strip() + FORWARD_SLASH_LITERAL + template_filename_from_xcom.strip(),
dag=dag
)
t_check_file_exists >> t_branch_on_file_existence
t_branch_on_file_existence >> t_upload_to_s3
t_branch_on_file_existence >> t_send_file_not_found_email
However, when I run the code, the branch operator always sees the string 'None'. 但是,当我运行代码时,分支运算符始终会看到字符串“ None”。
However, the Xcom has the value true. 但是,Xcom的值为true。
I tried debugging using IPython embed()
and see that the taskinstance doesnot hold the value of the xcom. 我尝试使用
IPython embed()
调试,发现taskinstance不包含xcom的值。 I tried using params, and other things that I could think of, but to no avail. 我尝试使用参数和其他我能想到的东西,但无济于事。
After spending days on this, I am now starting to think I have missed something crucial about the XCom in Airflow. 花了几天时间之后,我现在开始认为我已经错过了有关Airflow中XCom的重要信息。
Hoping anyone could help. 希望任何人都能提供帮助。
Thanks in advance. 提前致谢。
I think, the issue is with dependency. 我认为,问题在于依赖性。
You currently have the following: 您当前具有以下条件:
sample_task >> task_3
task_2 >> task_3
task_2 >> task_4
Change it to the following ie adding sample_task >> tasK_2
line. 将其更改为以下内容,即添加
sample_task >> tasK_2
行。
sample_task >> task_3
sample_task >> tasK_2
task_2 >> task_3
task_2 >> task_4
Your task that pushes to xcom should run first before the task that uses BranchPythonOperator
推送到xcom的任务应该在使用
BranchPythonOperator
的任务之前先运行
In your 2nd example, the branch
function uses xcom_pull(task_ids='get_fname_ships'
but I can't find any task with get_fname_ships
task_id. 在您的第二个示例中,
branch
函数使用xcom_pull(task_ids='get_fname_ships'
但我找不到任何带有get_fname_ships
task_id的任务。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.