"Google Cloud Dataflow Python，检索作业 ID"

Question

I am currently working on a Dataflow Template<\/strong> in Python, and I would like to access the Job ID and use it to save to a specific Firestore Document.我目前正在使用 Python 编写数据流模板<\/strong>，我想访问作业 ID 并使用它来保存到特定的 Firestore 文档。

Is it possible to access the Job ID?是否可以访问作业 ID？

I cannot find anything regarding this in the documentation.我在文档中找不到任何关于此的内容。

"

Answer 1

You can do so by calling dataflow.projects().locations().jobs().list from within the pipeline (see full code below). 您可以通过在管道中调用dataflow.projects().locations().jobs().list来实现此目的（请参见下面的完整代码）。 One possibility is to always invoke the template with the same job name, which would make sense, otherwise the job prefix could be passed as a runtime parameter. 一种可能性是始终使用相同的作业名称调用模板，这很有意义，否则可以将作业前缀作为运行时参数传递。 The list of jobs is parsed applying a regex to see if the job contains the name prefix and, if so, returns the job ID. 使用正则表达式解析作业列表，以查看该作业是否包含名称前缀，如果包含名称前缀，则返回该作业ID。 In case there are more than one it will only return the latest one (which is the one currently running). 如果有多个，它将仅返回最新的一个（当前正在运行的一个）。

The template is staged, after defining the PROJECT and BUCKET variables, with: 在定义PROJECT和BUCKET变量之后，通过以下步骤暂存该模板：

python script.py \
    --runner DataflowRunner \
    --project $PROJECT \
    --staging_location gs://$BUCKET/staging \
    --temp_location gs://$BUCKET/temp \
    --template_location gs://$BUCKET/templates/retrieve_job_id

Then, specify the desired job name ( myjobprefix in my case) when executing the templated job: 然后，在执行模板化作业时指定所需的作业名称（在我的情况下为myjobprefix ）：

gcloud dataflow jobs run myjobprefix \
   --gcs-location gs://$BUCKET/templates/retrieve_job_id

The retrieve_job_id function will return the job ID from within the job, change the job_prefix to match the name given. 该retrieve_job_id功能从作业中返回作业ID，改变job_prefix匹配给定的名称。

import argparse, logging, re
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions


def retrieve_job_id(element):
  project = 'PROJECT_ID'
  job_prefix = "myjobprefix"
  location = 'us-central1'

  logging.info("Looking for jobs with prefix {} in region {}...".format(job_prefix, location))

  try:
    credentials = GoogleCredentials.get_application_default()
    dataflow = build('dataflow', 'v1b3', credentials=credentials)

    result = dataflow.projects().locations().jobs().list(
      projectId=project,
      location=location,
    ).execute()

    job_id = "none"

    for job in result['jobs']:
      if re.findall(r'' + re.escape(job_prefix) + '', job['name']):
        job_id = job['id']
        break

    logging.info("Job ID: {}".format(job_id))
    return job_id

  except Exception as e:
    logging.info("Error retrieving Job ID")
    raise KeyError(e)


def run(argv=None):
  parser = argparse.ArgumentParser()
  known_args, pipeline_args = parser.parse_known_args(argv)

  pipeline_options = PipelineOptions(pipeline_args)
  pipeline_options.view_as(SetupOptions).save_main_session = True

  p = beam.Pipeline(options=pipeline_options)

  init_data = (p
               | 'Start' >> beam.Create(["Init pipeline"])
               | 'Retrieve Job ID' >> beam.FlatMap(retrieve_job_id))

  p.run()


if __name__ == '__main__':
  run()

Answer 2

You can use the Google Dataflow API . 您可以使用Google Dataflow API 。 Use the projects.jobs.list method to retrieve Dataflow Job IDs. 使用projects.jobs.list方法检索数据流作业ID。

Answer 3

From skimming over the documentation, the response you should get from launching the job should contain a json body with a property "job" that is an instance of Job . 通过浏览文档，您应该从启动作业中获得的响应应包含一个带有属性“作业”的json主体，该属性是Job的一个实例。

You should be able to use this to get the Id you need. 您应该可以使用它来获取所需的ID。

If you are using the google cloud sdk for dataflow, you might get a different object when you call the create method on templates() . 如果您使用google cloud sdk进行数据流，则在templates()上调用create方法时，可能会得到其他对象。

Answer 4

The following snippet launches a Dataflow template stored in a GCS bucket, gets the job id from the response body of the launch template API, and finally polls for the final job state of the Dataflow Job every 10 seconds, for example.例如，以下代码段启动存储在 GCS 存储桶中的 Dataflow 模板，从启动模板 API 的响应正文中获取作业 ID，最后每 10 秒轮询一次 Dataflow 作业的最终作业状态。

The official documentation by Google Cloud for the response body is here . Google Cloud 的响应正文的官方文档在这里。

So far I have only seen six job states of a Dataflow Job, please let me know if I have missed the others.到目前为止，我只看到了 Dataflow Job 的六个作业状态，如果我错过了其他的，请告诉我。

def launch_dataflow_template(project_id, location, credentials, template_path):
    dataflow = googleapiclient.discovery.build('dataflow', 'v1b3', credentials=credentials)
    logger.info(f"Template path: {template_path}")
    result = dataflow.projects().locations().templates().launch(
            projectId=project_id,
            location=location,
            body={
                ...
            },
            gcsPath=template_path  # dataflow template path
    ).execute()
    return result.get('job', {}).get('id')

def poll_dataflow_job_status(project_id, location, credentials, job_id):
    dataflow = googleapiclient.discovery.build('dataflow', 'v1b3', credentials=credentials)
    # executing states are not the final states of a Dataflow job, they show that the Job is transitioning into another upcoming state
    executing_states = ['JOB_STATE_PENDING', 'JOB_STATE_RUNNING', 'JOB_STATE_CANCELLING']
    # final states do not change further
    final_states = ['JOB_STATE_DONE', 'JOB_STATE_FAILED', 'JOB_STATE_CANCELLED']
    while True:
        job_desc =_get_dataflow_job_status(dataflow, project_id, location, job_id)
        if job_desc['currentState'] in executing_states:
            pass
        elif job_desc['currentState'] in final_states:
            break
        sleep(10)
    return job_id, job_desc['currentState']

Answer 5

You can get gcp metadata from using these beam functions in 2.35.0.您可以通过在 2.35.0 中使用这些光束函数来获取 gcp 元数据。 You can visit documentation https://beam.apache.org/releases/pydoc/2.35.0/_modules/apache_beam/io/gcp/gce_metadata_util.html#fetch_dataflow_job_id您可以访问文档https://beam.apache.org/releases/pydoc/2.35.0/_modules/apache_beam/io/gcp/gce_metadata_util.html#fetch_dataflow_job_id

beam.io.gcp.gce_metadata_util._fetch_custom_gce_metadata("job_name")
beam.io.gcp.gce_metadata_util._fetch_custom_gce_metadata("job_id")

"Google Cloud Dataflow Python，检索作业 ID"

问题描述

5 个解决方案

解决方案1
3 已采纳 2018-09-18 23:01:12

解决方案2
2 2018-09-18 12:46:17

解决方案3
0 2018-09-17 21:04:07

解决方案4
0 2022-01-20 11:55:47

解决方案5
0 2022-02-03 17:09:55

"Google Cloud Dataflow Python，检索作业 ID"

问题描述

5 个解决方案

解决方案1 3 已采纳 2018-09-18 23:01:12

解决方案2 2 2018-09-18 12:46:17

解决方案3 0 2018-09-17 21:04:07

解决方案4 0 2022-01-20 11:55:47

解决方案5 0 2022-02-03 17:09:55

解决方案1
3 已采纳 2018-09-18 23:01:12

解决方案2
2 2018-09-18 12:46:17

解决方案3
0 2018-09-17 21:04:07

解决方案4
0 2022-01-20 11:55:47

解决方案5
0 2022-02-03 17:09:55