[英]Google Cloud Dataflow Python, Retrieving Job ID
我目前正在使用 Python 編寫數據流模板<\/strong>,我想訪問作業 ID 並使用它來保存到特定的 Firestore 文檔。
是否可以訪問作業 ID?
我在文檔中找不到任何關於此的內容。
您可以通過在管道中調用dataflow.projects().locations().jobs().list
來實現此目的(請參見下面的完整代碼)。 一種可能性是始終使用相同的作業名稱調用模板,這很有意義,否則可以將作業前綴作為運行時參數傳遞。 使用正則表達式解析作業列表,以查看該作業是否包含名稱前綴,如果包含名稱前綴,則返回該作業ID。 如果有多個,它將僅返回最新的一個(當前正在運行的一個)。
在定義PROJECT
和BUCKET
變量之后,通過以下步驟暫存該模板:
python script.py \
--runner DataflowRunner \
--project $PROJECT \
--staging_location gs://$BUCKET/staging \
--temp_location gs://$BUCKET/temp \
--template_location gs://$BUCKET/templates/retrieve_job_id
然后,在執行模板化作業時指定所需的作業名稱(在我的情況下為myjobprefix
):
gcloud dataflow jobs run myjobprefix \
--gcs-location gs://$BUCKET/templates/retrieve_job_id
該retrieve_job_id
功能從作業中返回作業ID,改變job_prefix
匹配給定的名稱。
import argparse, logging, re
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
def retrieve_job_id(element):
project = 'PROJECT_ID'
job_prefix = "myjobprefix"
location = 'us-central1'
logging.info("Looking for jobs with prefix {} in region {}...".format(job_prefix, location))
try:
credentials = GoogleCredentials.get_application_default()
dataflow = build('dataflow', 'v1b3', credentials=credentials)
result = dataflow.projects().locations().jobs().list(
projectId=project,
location=location,
).execute()
job_id = "none"
for job in result['jobs']:
if re.findall(r'' + re.escape(job_prefix) + '', job['name']):
job_id = job['id']
break
logging.info("Job ID: {}".format(job_id))
return job_id
except Exception as e:
logging.info("Error retrieving Job ID")
raise KeyError(e)
def run(argv=None):
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = True
p = beam.Pipeline(options=pipeline_options)
init_data = (p
| 'Start' >> beam.Create(["Init pipeline"])
| 'Retrieve Job ID' >> beam.FlatMap(retrieve_job_id))
p.run()
if __name__ == '__main__':
run()
您可以使用Google Dataflow API 。 使用projects.jobs.list方法檢索數據流作業ID。
例如,以下代碼段啟動存儲在 GCS 存儲桶中的 Dataflow 模板,從啟動模板 API 的響應正文中獲取作業 ID,最后每 10 秒輪詢一次 Dataflow 作業的最終作業狀態。
Google Cloud 的響應正文的官方文檔在這里。
到目前為止,我只看到了 Dataflow Job 的六個作業狀態,如果我錯過了其他的,請告訴我。
def launch_dataflow_template(project_id, location, credentials, template_path):
dataflow = googleapiclient.discovery.build('dataflow', 'v1b3', credentials=credentials)
logger.info(f"Template path: {template_path}")
result = dataflow.projects().locations().templates().launch(
projectId=project_id,
location=location,
body={
...
},
gcsPath=template_path # dataflow template path
).execute()
return result.get('job', {}).get('id')
def poll_dataflow_job_status(project_id, location, credentials, job_id):
dataflow = googleapiclient.discovery.build('dataflow', 'v1b3', credentials=credentials)
# executing states are not the final states of a Dataflow job, they show that the Job is transitioning into another upcoming state
executing_states = ['JOB_STATE_PENDING', 'JOB_STATE_RUNNING', 'JOB_STATE_CANCELLING']
# final states do not change further
final_states = ['JOB_STATE_DONE', 'JOB_STATE_FAILED', 'JOB_STATE_CANCELLED']
while True:
job_desc =_get_dataflow_job_status(dataflow, project_id, location, job_id)
if job_desc['currentState'] in executing_states:
pass
elif job_desc['currentState'] in final_states:
break
sleep(10)
return job_id, job_desc['currentState']
您可以通過在 2.35.0 中使用這些光束函數來獲取 gcp 元數據。 您可以訪問文檔https://beam.apache.org/releases/pydoc/2.35.0/_modules/apache_beam/io/gcp/gce_metadata_util.html#fetch_dataflow_job_id
beam.io.gcp.gce_metadata_util._fetch_custom_gce_metadata("job_name")
beam.io.gcp.gce_metadata_util._fetch_custom_gce_metadata("job_id")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.