[英]Scheduling a Google Batch job to trigger on GCP (long running Python script)
By "Google Batch" I'm referring to the new service Google launched about a month or so ago.我所说的“Google Batch”指的是大约一个月前 Google 推出的新服务。
https://cloud.google.com/batch https://cloud.google.com/batch
I have a Python script which takes a few minutes to execute at the moment.我有一个 Python 脚本,目前需要几分钟才能执行。 However with the data it will soon be processing in the next few months this execution time will go from minutes to hours .然而,对于将在接下来的几个月内处理的数据,此执行时间将从几分钟到几小时go。 This is why I am not using Cloud Function or Cloud Run to run this script, both of these have a max 60 minute execution time.这就是为什么我不使用 Cloud Function 或 Cloud Run 来运行此脚本的原因,这两者的执行时间最长为 60 分钟。
Google Batch came about recently and I wanted to explore this as a possible method to achieve what I'm looking for without just using Compute Engine. Google Batch 是最近出现的,我想探索它作为一种可能的方法来实现我正在寻找的东西,而不仅仅是使用 Compute Engine。
However documentation is sparse across the inte.net and I can't find a method to "trigger" an already created Batch job by using Cloud Scheduler.但是,整个 inte.net 上的文档很少,我找不到使用 Cloud Scheduler 来“触发”已创建的 Batch 作业的方法。 I've already successfully manually created a batch job which runs my docker image.我已经成功地手动创建了一个运行我的 docker 图像的批处理作业。 Now I need something to trigger this batch job 1x a day, thats it .现在我需要一些东西来每天触发这个批处理作业 1x,就是这样。 It would be wonderful if Cloud Scheduler could serve this purpose.如果 Cloud Scheduler 可以达到这个目的,那就太好了。
I've seen 1 article describing using GCP Workflow to create aa new Batch job on a cron determined by Cloud Scheduler.我看过一篇文章,描述了使用 GCP Workflow 在 Cloud Scheduler 确定的 cron 上创建一个新的批处理作业。 Issue with this is its creating a new batch job every time, not simply re-running the already existing one.问题在于它每次都创建一个新的批处理作业,而不是简单地重新运行已经存在的批处理作业。 To be honest I can't even re-run an already executed batch job on the GCP website itself so I don't know if its even possible.老实说,我什至无法在 GCP 网站上重新运行已经执行的批处理作业,所以我不知道它是否可能。
https://www.intertec.io/resource/python-script-on-gcp-batch https://www.intertec.io/resource/python-script-on-gcp-batch
Lastly, I've even explored the official Google Batch Python library and could not find anywhere in there some built in function which allows me to "call" a previously created batch job and just re-run it.最后,我什至探索了官方的 Google Batch Python 库,但在其中找不到任何内置的 function 允许我“调用”先前创建的批处理作业并重新运行它。
https://github.com/googleapis/python-batch https://github.com/googleapis/python-batch
There is a misunderstanding.有一个误会。 When you use Cloud Run jobs, you create a configuration and you execute a configuration.当您使用 Cloud Run 作业时,您可以创建配置并执行配置。
BUT, with Batch job, you execute a configuration.但是,使用批处理作业,您可以执行配置。 That's all, no configuration to create in advance.仅此而已,无需提前创建配置。
Have a look to the APIs : Create, Get, Delete.查看API :创建、获取、删除。 No more.不再。
Therefore, you have to set in your Cloud Scheduler, the whole Batch configuration to create a new job.因此,您必须在 Cloud Scheduler 中设置整个 Batch 配置以创建新作业。 Take care to NOT set the jobID in the query parameter.注意不要在查询参数中设置 jobID。
I wrote this for you this morning as a guide.我今天早上为你写这篇文章作为指南。
It uses Google's example in combination with Cloud Scheduler:它结合使用 Google 的示例和 Cloud Scheduler:
# Used to correctly (!?) form Batch Job
import google.cloud.batch_v1.types
import google.cloud.scheduler_v1
import google.cloud.scheduler_v1.types
import os
project = os.getenv("PROJECT")
number = os.getenv("NUMBER")
location = os.getenv("LOCATION")
job = os.getenv("JOB")
# Batch Job
# Create Batch Job using batch_v1.types
# Alternatively, create this from scratch
batch_job = google.cloud.batch_v1.types.Job(
priority=0,
task_groups=[
google.cloud.batch_v1.types.TaskGroup(
task_spec=google.cloud.batch_v1.types.TaskSpec(
runnables=[
google.cloud.batch_v1.types.Runnable(
container=google.cloud.batch_v1.types.Runnable.Container(
image_uri="gcr.io/google-containers/busybox",
entrypoint="/bin/sh",
commands=[
"-c",
"echo \"Hello world! This is task ${BATCH_TASK_INDEX}. This job has a total of ${BATCH_TASK_COUNT} tasks.\""
],
),
),
],
compute_resource=google.cloud.batch_v1.types.ComputeResource(
cpu_milli=2000,
memory_mib=16,
)
),
task_count=1,
parallelism=1,
),
],
allocation_policy=google.cloud.batch_v1.types.AllocationPolicy(
location=google.cloud.batch_v1.types.AllocationPolicy.LocationPolicy(
allowed_locations=[
f"regions/{location}",
],
),
instances=[
google.cloud.batch_v1.types.AllocationPolicy.InstancePolicyOrTemplate(
policy=google.cloud.batch_v1.types.AllocationPolicy.InstancePolicy(
machine_type="e2-standard-2",
),
),
],
),
labels={
"stackoverflow":"73966292",
},
logs_policy=google.cloud.batch_v1.types.LogsPolicy(
destination=google.cloud.batch_v1.types.LogsPolicy.Destination.CLOUD_LOGGING,
),
)
# Convert the Google Batch Job into JSON
# Google uses Proto Python
# https://proto-plus-python.readthedocs.io/en/stable/messages.html?highlight=JSON#serialization
batch_json=google.cloud.batch_v1.types.Job.to_json(batch_job)
print(batch_json)
# Convert JSON to bytes as required for body by Cloud Scheduler
body=batch_json.encode("utf-8")
# Run hourly on the hour (HH:00)
schedule = "0 * * * *"
parent = f"projects/{project}/locations/{location}"
name = f"{parent}/jobs/{job}"
uri = f"https://batch.googleapis.com/v1/{parent}/jobs?job_id={job}"
service_account_email = f"{number}-compute@developer.gserviceaccount.com"
scheduler_job = google.cloud.scheduler_v1.types.Job(
name=name,
description="description",
http_target=google.cloud.scheduler_v1.types.HttpTarget(
uri=uri,
http_method=google.cloud.scheduler_v1.types.HttpMethod(
google.cloud.scheduler_v1.types.HttpMethod.POST,
),
oauth_token=google.cloud.scheduler_v1.types.OAuthToken(
service_account_email=service_account_email,
),
body=body,
),
schedule=schedule,
)
scheduler_json=google.cloud.scheduler_v1.Job.to_json(scheduler_job)
print(scheduler_job)
request = google.cloud.scheduler_v1.CreateJobRequest(
parent=parent,
job=scheduler_job,
)
scheduler_client = google.cloud.scheduler_v1.CloudSchedulerClient()
print(
scheduler_client.create_job(
request=request
)
)
You can test using:您可以测试使用:
BILLING="..."
PROJECT="..."
LOCATION="..." # E.g. us-west1
JOB="tester"
ACCOUNT="tester"
EMAIL="${ACCOUNT}@${PROJECT}.iam.gserviceaccount.com"
# Create Project and enable Billing
gcloud projects create ${PROJECT}
gcloud beta billing projects link ${PROJECT} \
--billing-account=${BILLING}
# Enable Cloud Scheduler and Cloud Run
SERVICES=(
"batch"
"cloudscheduler"
"compute"
)
for SERVICE in ${SERVICES[@]}
do
gcloud services enable ${SERVICE}.googleapis.com \
--project=${PROJECT}
done
# Create Service Account
gcloud iam service-accounts create ${ACCOUNT} \
--project=${PROJECT}
gcloud iam service-accounts keys create ${PWD}/${ACCOUNT}.json \
--iam-account=${EMAIL} \
--project=${PROJECT}
# IAM
# https://cloud.google.com/iam/docs/understanding-roles#cloud-scheduler-roles
ROLES=(
"roles/batch.jobsEditor"
"roles/cloudscheduler.admin"
)
for ROLE in ${ROLES[@]}
do
gcloud projects add-iam-policy-binding ${PROJECT} \
--member=serviceAccount:${EMAIL} \
--role=${ROLE}
done
# ActAs
NUMBER=$(\
gcloud projects describe ${PROJECT} \
--format="value(projectNumber)")
COMPUTE_ENGINE="${NUMBER}-compute@developer.gserviceaccount.com"
gcloud iam service-accounts add-iam-policy-binding ${COMPUTE_ENGINE} \
--member=serviceAccount:${EMAIL} \
--role="roles/iam.serviceAccountUser" \
--project=${PROJECT}
Then:然后:
python3 -m venv venv
source venv/bin/activate
# Or requirements.txt
python3 -m pip install google-cloud-batch
python3 -m pip install google-cloud-scheduler
export JOB
export LOCATION
export NUMBER
export PROJECT
export GOOGLE_APPLICATION_CREDENTIALS=${PWD}/${ACCOUNT}.json
python3 main.py
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.