简体   繁体   English

如何在气流中单独运行任务?

[英]How to individually run task separately in airflow?

I have a list of tables I want to run my script through.我有一个要运行脚本的表列表。 It works successfully when I do one table at a time but when I try a for loop above the tasks, it run all the tables at once giving me multiple errors.当我一次执行一个表时,它可以成功运行,但是当我在任务上方尝试 for 循环时,它一次运行所有表,给我多个错误。

Here is my code:这是我的代码:

def create_tunnel_postgres():

    psql_host = ''
    psql_port = 5432
    ssh_host= ''
    ssh_port = 22
    ssh_username = ''
    pkf = paramiko.RSAKey.from_private_key(StringIO(Variable.get('my_key')))

    server = SSHTunnelForwarder(
        (ssh_host, 22),
        ssh_username=ssh_username,
        ssh_private_key=pkf,
        remote_bind_address=(psql_host, 5432))

    return server

def conn_postgres_internal(server):
    """
    Using the server connect to the internal postgres
    """
    conn = psycopg2.connect(
        database='pricing',
        user= Variable.get('postgres_db_user'),
        password= Variable.get('postgres_db_key'),
        host=server.local_bind_host,
        port=server.local_bind_port,
    )

    return conn

def gzip_postgres_table(**kwargs):
    """

    path='/path/{}.csv'.format(table_name)
    server_postgres = create_tunnel_postgres()
    server_postgres.start()
    etl_conn = conn_postgres_internal(server_postgres)
    cur=etl_conn.cursor()
    cur.execute("""
        select * from schema.db.{} limit 100;
        """.format(table_name))
    result = cur.fetchall()
    column_names = [i[0] for i in cur.description]
    fp = gzip.open(path, 'wt')
    myFile = csv.writer(fp,delimiter=',')
    myFile.writerow(column_names)
    myFile.writerows(result)
    fp.close()
    etl_conn.close()
    server_postgres.stop()


#------------------------------------------------------------------------------------------------------------------------------------------------

default_args = {
    'owner': 'mae',
    'depends_on_past':False,
    'start_date': datetime(2020,1,1),
    'email': ['maom@aol.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 0,
    'retry_delay': timedelta(minutes=1)
}


tables= ['table1','table2']
s3_folder='de'
current_timestamp=datetime.now()



#Element'S VARIABLES

dag = DAG('dag1',
          description = 'O',
          default_args=default_args,
          max_active_runs=1,
          schedule_interval= '@once',
          #schedule_interval='hourly'
          catchup = False )


for table_name in pricing_table_name:
    t1 = PythonOperator(
        task_id='{}_gzip_table'.format(table_name),
        python_callable= gzip_postgres_table,
        provide_context=True,
        op_kwargs={'table_name':table_name,'s3_folder':s3_folder,'current_timestamp':current_timestamp},
        dag = dag)

Is there a way to run table1 first..let it finish and then run table 2?有没有办法先运行 table1..让它完成然后运行 ​​table 2? I tried doing that with the for table_name in tables: but to no avail.我尝试使用 for table_name in tables: 但无济于事。 Any ideas or suggestions would help.任何想法或建议都会有所帮助。

Your for is creating multiple tasks for your tables processing, this will parallelize the execution of the tasks by default on airflow.for正在为您的表处理创建多个任务,这将默认情况下在气流上并行执行任务。

You can either set the number of workers in the airflow config file to 1, or create only 1 task and run your loop inside the task, which will then be executed synchronously.您可以将气流配置文件中的工作人员数量设置为 1,或者仅创建 1 个任务并在任务内运行循环,然后将同步执行。

I saw your code, and it seems like you're creating multiple DAG tasks using looping statement, which runs the task in parallel.我看到了您的代码,您似乎正在使用循环语句创建多个 DAG 任务,该语句并行运行该任务。

There are certain ways to achieve your requirement.有一些方法可以实现您的要求。

  1. use sequential_executor.使用sequential_executor。

airflow.executors.sequential_executor.SequentialExecutor which will only run task instances sequentially.

https://airflow.apache.org/docs/stable/start.html#quick-start https://airflow.apache.org/docs/stable/start.html#quick-start

  1. create a script that works according to your need.创建一个根据您的需要工作的脚本。

Create a script(Python) and use it as PythonOperator that repeats your current function for number of tables.创建一个脚本(Python)并将其用作 PythonOperator,它针对表的数量重复您当前的函数。

  1. limit airflow executors(parallelism) to 1.将气流执行器(并行度)限制为 1。

You can limit your airflow workers to 1 in its airflow.cfg config file.您可以在其airflow.cfg配置文件中将您的气流工作人员限制为1。

Steps:脚步:

open airflow.cfg from your airflow root(AIRFLOW_HOME).从您的气流根(AIRFLOW_HOME)打开气流.cfg。

set/update parallelism = 1设置/更新parallelism = 1

restart your airflow.重新启动气流。

this should work.这应该有效。

I see 3 way of solving this.我看到解决这个问题的 3 种方法。

i thing You need DAG like this你需要像这样的 DAG 在此处输入图像描述

Code for it:代码:

from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator

import sys
sys.path.append('../')
from mssql_loader import core #program code, which start load 
from mssql_loader import locals #local variables, contains dictionaries with name
def contact_load(typ,db):

    core.starter(typ=typ,db=db)
    return 'MSSQL LOADED '+db['DBpseudo']+'.'+typ

dag = DAG('contact_loader', description='MSSQL sqlcontact.uka.local loader to GBQ',
          schedule_interval='0 7 * * *',
          start_date=datetime(2017, 3, 20), catchup=False)

start_operator = DummyOperator(task_id='ROBO_task', retries=3, dag=dag)


for v in locals.TABLES:
    for db in locals.DB:        
        task = PythonOperator(
            task_id=db['DBpseudo']+'_mssql_' + v, #create Express_mssql_fast , UKA_mssql_important and etc
            python_callable=contact_load,
            op_kwargs={'typ': v,'db':db},
            retries=3,
            dag=dag,
        )

        start_operator >> task #create parent-child connection to from first task to other

dag = DAG(dag_id='you_DAG',default_args=default_args,schedule_interval='10 6 * * *',max_active_runs=1 --- 这里只执行 1 个任务)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM