简体   繁体   中英

Composer does not see dataflow job succeeded

I am using Gcloud Composer to launch Dataflow jobs.

My DAG consist of two Dataflow jobs that should be run one after the other.

import datetime

from airflow.contrib.operators.dataflow_operator import DataflowTemplateOperator
from airflow import models


default_dag_args = {

    'start_date': datetime.datetime(2019, 10, 23),
    'dataflow_default_options': {
               'project': 'myproject',
               'region': 'europe-west1',
               'zone': 'europe-west1-c',
               'tempLocation': 'gs://somebucket/',
               }
}

with models.DAG(
        'some_name',
        schedule_interval=datetime.timedelta(days=1),
        default_args=default_dag_args) as dag:

    parameters = {'params': "param1"}

    t1 = DataflowTemplateOperator(
        task_id='dataflow_example_01',
        template='gs://path/to/template/template_001',
        parameters=parameters,
        dag=dag)

    parameters2 = {'params':"param2"}

    t2 = DataflowTemplateOperator(
        task_id='dataflow_example_02',
        template='gs://path/to/templates/template_002',
        parameters=parameters2,
        dag=dag
    )

    t1 >> t2

When I check in dataflow the job has succeeded, all the files it is supposed to make are created, but it appears it ran in US region, the cloud composer environment is in Europe west.

In airflow I can see that the first job is still running so the second one is not launched

在此处输入图像描述

What should I add to the DAG to make it succeed? How do I run in Europe?

Any advice or solution on how to proceed would be most appreciated. Thanks!

I had to solve this issue in the past. In Airflow 1.10.2 (or lower) the code calls to the service.projects().templates().launch() endpoint. This was fixed in 1.10.3 where the regional one is used instead: service.projects().locations().templates().launch() .

As of October 2019, the latest Airflow version available for Composer environments is 1.10.2. If you need a solution immediately, the fix can be back-ported into Composer.

For this we can override the DataflowTemplateOperator for our own version called RegionalDataflowTemplateOperator :

class RegionalDataflowTemplateOperator(DataflowTemplateOperator):
  def execute(self, context):
    hook = RegionalDataFlowHook(gcp_conn_id=self.gcp_conn_id,
                        delegate_to=self.delegate_to,
                        poll_sleep=self.poll_sleep)

    hook.start_template_dataflow(self.task_id, self.dataflow_default_options,
                                 self.parameters, self.template)

This will now make use of the modified RegionalDataFlowHook which overrides the start_template_dataflow method of the DataFlowHook operator to call the correct endpoint:

class RegionalDataFlowHook(DataFlowHook):
  def _start_template_dataflow(self, name, variables, parameters,
                               dataflow_template):
      ...
      request = service.projects().locations().templates().launch(
          projectId=variables['project'],
          location=variables['region'],
          gcsPath=dataflow_template,
          body=body
      )
      ...
      return response

Then, we can create a task using our new operator and a Google-provided template (for testing purposes):

task = RegionalDataflowTemplateOperator(
    task_id=JOB_NAME,
    template=TEMPLATE_PATH,
    parameters={
        'inputFile': 'gs://dataflow-samples/shakespeare/kinglear.txt',
        'output': 'gs://{}/europe/output'.format(BUCKET)
    },
    dag=dag,
)

Full working DAG here . For a cleaner version, the operator can be moved into a separate module.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM