简体   繁体   English

安排 Google Batch 作业在 GCP 上触发(长时间运行 Python 脚本)

[英]Scheduling a Google Batch job to trigger on GCP (long running Python script)

By "Google Batch" I'm referring to the new service Google launched about a month or so ago.我所说的“Google Batch”指的是大约一个月前 Google 推出的新服务。

https://cloud.google.com/batch https://cloud.google.com/batch

I have a Python script which takes a few minutes to execute at the moment.我有一个 Python 脚本,目前需要几分钟才能执行。 However with the data it will soon be processing in the next few months this execution time will go from minutes to hours .然而,对于将在接下来的几个月内处理的数据,此执行时间将从几分钟到几小时go。 This is why I am not using Cloud Function or Cloud Run to run this script, both of these have a max 60 minute execution time.这就是为什么我不使用 Cloud Function 或 Cloud Run 来运行此脚本的原因,这两者的执行时间最长为 60 分钟。

Google Batch came about recently and I wanted to explore this as a possible method to achieve what I'm looking for without just using Compute Engine. Google Batch 是最近出现的,我想探索它作为一种可能的方法来实现我正在寻找的东西,不仅仅是使用 Compute Engine。

However documentation is sparse across the inte.net and I can't find a method to "trigger" an already created Batch job by using Cloud Scheduler.但是,整个 inte.net 上的文档很少,我找不到使用 Cloud Scheduler 来“触发”已创建的 Batch 作业的方法。 I've already successfully manually created a batch job which runs my docker image.我已经成功地手动创建了一个运行我的 docker 图像的批处理作业。 Now I need something to trigger this batch job 1x a day, thats it .现在我需要一些东西来每天触发这个批处理作业 1x,就是这样 It would be wonderful if Cloud Scheduler could serve this purpose.如果 Cloud Scheduler 可以达到这个目的,那就太好了。

I've seen 1 article describing using GCP Workflow to create aa new Batch job on a cron determined by Cloud Scheduler.我看过一篇文章,描述了使用 GCP Workflow 在 Cloud Scheduler 确定的 cron 上创建一个新的批处理作业。 Issue with this is its creating a new batch job every time, not simply re-running the already existing one.问题在于它每次都创建一个新的批处理作业,而不是简单地重新运行已经存在的批处理作业。 To be honest I can't even re-run an already executed batch job on the GCP website itself so I don't know if its even possible.老实说,我什至无法在 GCP 网站上重新运行已经执行的批处理作业,所以我不知道它是否可能。

https://www.intertec.io/resource/python-script-on-gcp-batch https://www.intertec.io/resource/python-script-on-gcp-batch

Lastly, I've even explored the official Google Batch Python library and could not find anywhere in there some built in function which allows me to "call" a previously created batch job and just re-run it.最后,我什至探索了官方的 Google Batch Python 库,但在其中找不到任何内置的 function 允许我“调用”先前创建的批处理作业并重新运行它。

https://github.com/googleapis/python-batch https://github.com/googleapis/python-batch

There is a misunderstanding.有一个误会。 When you use Cloud Run jobs, you create a configuration and you execute a configuration.当您使用 Cloud Run 作业时,您可以创建配置并执行配置。

BUT, with Batch job, you execute a configuration.但是,使用批处理作业,您可以执行配置。 That's all, no configuration to create in advance.仅此而已,无需提前创建配置。

Have a look to the APIs : Create, Get, Delete.查看API :创建、获取、删除。 No more.不再。

Therefore, you have to set in your Cloud Scheduler, the whole Batch configuration to create a new job.因此,您必须在 Cloud Scheduler 中设置整个 Batch 配置以创建新作业。 Take care to NOT set the jobID in the query parameter.注意不要在查询参数中设置 jobID。

I wrote this for you this morning as a guide.我今天早上为你写这篇文章作为指南。

It uses Google's example in combination with Cloud Scheduler:它结合使用 Google 的示例和 Cloud Scheduler:

# Used to correctly (!?) form Batch Job
import google.cloud.batch_v1.types

import google.cloud.scheduler_v1
import google.cloud.scheduler_v1.types

import os


project = os.getenv("PROJECT")
number = os.getenv("NUMBER")
location = os.getenv("LOCATION")
job = os.getenv("JOB")

# Batch Job
# Create Batch Job using batch_v1.types
# Alternatively, create this from scratch
batch_job = google.cloud.batch_v1.types.Job(
    priority=0,
    task_groups=[
        google.cloud.batch_v1.types.TaskGroup(
            task_spec=google.cloud.batch_v1.types.TaskSpec(
                runnables=[
                    google.cloud.batch_v1.types.Runnable(
                        container=google.cloud.batch_v1.types.Runnable.Container(
                            image_uri="gcr.io/google-containers/busybox",
                            entrypoint="/bin/sh",
                            commands=[
                                "-c",
                                "echo \"Hello world! This is task ${BATCH_TASK_INDEX}. This job has a total of ${BATCH_TASK_COUNT} tasks.\""
                            ],
                        ),
                    ),
                ],
                compute_resource=google.cloud.batch_v1.types.ComputeResource(
                    cpu_milli=2000,
                    memory_mib=16,
                )
            ),
            task_count=1,
            parallelism=1,
        ),
    ],
    allocation_policy=google.cloud.batch_v1.types.AllocationPolicy(
        location=google.cloud.batch_v1.types.AllocationPolicy.LocationPolicy(
           allowed_locations=[
            f"regions/{location}",
           ], 
        ),
        instances=[
            google.cloud.batch_v1.types.AllocationPolicy.InstancePolicyOrTemplate(
                policy=google.cloud.batch_v1.types.AllocationPolicy.InstancePolicy(
                    machine_type="e2-standard-2",
                ),
            ),
        ],
    ),
    labels={
        "stackoverflow":"73966292",
    },
    logs_policy=google.cloud.batch_v1.types.LogsPolicy(
        destination=google.cloud.batch_v1.types.LogsPolicy.Destination.CLOUD_LOGGING,
    ),
)

# Convert the Google Batch Job into JSON
# Google uses Proto Python
# https://proto-plus-python.readthedocs.io/en/stable/messages.html?highlight=JSON#serialization
batch_json=google.cloud.batch_v1.types.Job.to_json(batch_job)
print(batch_json)

# Convert JSON to bytes as required for body by Cloud Scheduler
body=batch_json.encode("utf-8")

# Run hourly on the hour (HH:00)
schedule = "0 * * * *"

parent = f"projects/{project}/locations/{location}"
name = f"{parent}/jobs/{job}"
uri = f"https://batch.googleapis.com/v1/{parent}/jobs?job_id={job}"

service_account_email = f"{number}-compute@developer.gserviceaccount.com"

scheduler_job = google.cloud.scheduler_v1.types.Job(
    name=name,
    description="description",
    http_target=google.cloud.scheduler_v1.types.HttpTarget(
        uri=uri,
        http_method=google.cloud.scheduler_v1.types.HttpMethod(
            google.cloud.scheduler_v1.types.HttpMethod.POST,
        ),
        oauth_token=google.cloud.scheduler_v1.types.OAuthToken(
            service_account_email=service_account_email,
        ),
        body=body,
    ),
    schedule=schedule,
)

scheduler_json=google.cloud.scheduler_v1.Job.to_json(scheduler_job)
print(scheduler_job)

request = google.cloud.scheduler_v1.CreateJobRequest(
    parent=parent,
    job=scheduler_job,
)

scheduler_client = google.cloud.scheduler_v1.CloudSchedulerClient()
print(
    scheduler_client.create_job(
        request=request
    )
)

You can test using:您可以测试使用:

BILLING="..."
PROJECT="..."
LOCATION="..." # E.g. us-west1

JOB="tester"

ACCOUNT="tester"
EMAIL="${ACCOUNT}@${PROJECT}.iam.gserviceaccount.com"

# Create Project and enable Billing
gcloud projects create ${PROJECT}
gcloud beta billing projects link ${PROJECT} \
--billing-account=${BILLING}

# Enable Cloud Scheduler and Cloud Run
SERVICES=(
  "batch"
  "cloudscheduler"
  "compute"
)
for SERVICE in ${SERVICES[@]}
do
  gcloud services enable ${SERVICE}.googleapis.com \
  --project=${PROJECT}
done

# Create Service Account
gcloud iam service-accounts create ${ACCOUNT} \
--project=${PROJECT}

gcloud iam service-accounts keys create ${PWD}/${ACCOUNT}.json \
--iam-account=${EMAIL} \
--project=${PROJECT}

# IAM
# https://cloud.google.com/iam/docs/understanding-roles#cloud-scheduler-roles
ROLES=(
  "roles/batch.jobsEditor"
  "roles/cloudscheduler.admin"
)
for ROLE in ${ROLES[@]}
do
  gcloud projects add-iam-policy-binding ${PROJECT} \
  --member=serviceAccount:${EMAIL} \
  --role=${ROLE}
done

# ActAs
NUMBER=$(\
  gcloud projects describe ${PROJECT} \
  --format="value(projectNumber)")
COMPUTE_ENGINE="${NUMBER}-compute@developer.gserviceaccount.com"
gcloud iam service-accounts add-iam-policy-binding ${COMPUTE_ENGINE} \
--member=serviceAccount:${EMAIL} \
--role="roles/iam.serviceAccountUser" \
--project=${PROJECT}

Then:然后:

python3 -m venv venv
source venv/bin/activate

# Or requirements.txt
python3 -m pip install google-cloud-batch
python3 -m pip install google-cloud-scheduler

export JOB
export LOCATION
export NUMBER
export PROJECT

export GOOGLE_APPLICATION_CREDENTIALS=${PWD}/${ACCOUNT}.json

python3 main.py

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 GCP 中运行修改后的启动脚本时出现问题 - Problem running a modified startup-script in GCP 在使用 firebase 存储的 GCP 中运行 python 脚本 - Run python script in GCP that use firebase storage 从 Compute Engine 实例模板创建 Google Batch 作业,未正确导入自定义 Python 库 - Create a Google Batch Job from a Compute Engine instance template, not importing custom Python libraries correctly 从 api 网关运行批处理作业的语法 - Syntax for running batch job from api gateway 从 API 运行 AWS Batch 作业 - Running an AWS Batch job from an API 如何在 GCP 的 Cron 中运行 Python 脚本? - How to run Python script in Cron in GCP? 如何使用 Python 库轮询 Google 长时间运行的操作? - How do I poll Google Long-Running Operations using Python Library? 设计一个系统来处理长时间运行的 API 抓取作业 - Designing a system to handle long running API scraping job GCP Composer:在另一个 GCS 存储桶中运行 Python 脚本 - GCP Composer: Run Python Script in another GCS bucket 将 requirements.txt 传递给 Google Cloud Pyspark 批处理作业 - Passing requirements.txt to Google Cloud Pyspark Batch Job
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM