Cloud Run (Python) 中 GCS Bucket 上传调用的 Google PubSub Push SubscriptionTimeouts

Question

I'm trying to develop an image processing pipeline that takes videos uploaded to one GCS Bucket, extracts all the frames as jpg images, and uploads these images to another GCS bucket.我正在尝试开发一种图像处理管道，该管道将视频上传到一个 GCS 存储桶，将所有帧提取为 jpg 图像，并将这些图像上传到另一个 GCS 存储桶。 I'm using a PubSub push subscription to trigger the cloud run service.我正在使用 PubSub 推送订阅来触发云运行服务。 Unfortunately, the service cannot reliably process the videos under the 10 minute max request response timeout for push subscriptions.不幸的是，该服务无法在 10 分钟的最大请求响应超时时间内可靠地处理推送订阅的视频。 I've tracked the issue down, and it appears that uploading the frames to GCS is causing the bottleneck.我已经找到了问题，看来将帧上传到 GCS 会导致瓶颈。 The videos contain, on average, about 28000 frames (30FPS, ~15 Minutes in length).这些视频平均包含大约 28000 帧（30FPS，长度约为 15 分钟）。 I think this should be possible in the time provided.我认为在规定的时间内这应该是可能的。 All services are in the same region/zone.所有服务都在同一个区域/区域中。

Is there a way to increase the throughput for these GCS blob uploads?有没有办法增加这些 GCS blob 上传的吞吐量？ When I use gsutil to copy a video blob from bucket to another bucket (within the same region), it takes seconds.当我使用 gsutil 将视频 blob 从存储桶复制到另一个存储桶（在同一区域内）时，需要几秒钟。

I've tried increasing/decreasing thread count, increasing service CPU count, and increasing service memory count.我尝试增加/减少线程数、增加服务 CPU 数量和增加服务 memory 数量。 I don't see any change.我没有看到任何变化。 GCS Rate limits for writes over 1000/Sec , but I don't think I'm anywhere near this limit yet. GCS超过 1000/Sec 的写入速率限制，但我认为我还没有接近这个限制。

My service copies the main.py script as a part of the Google Cloud Run Vision Tutorial .我的服务将main.py脚本复制为Google Cloud Run Vision Tutorial的一部分。 The only modification is to change the call to my processing routine in video.py .唯一的修改是在video.py中更改对我的处理例程的调用。 I've included at the bottom of the post.我已经包含在帖子的底部。 video.py runs the actual processing. video.py运行实际处理。

Cloud Run Service is provisioned with 1 CPU, 512 MiB, 15 min timeout Cloud PubSub Subscription (push subscription) 10 min timeout (maximum) Cloud Run Service 配备 1 个 CPU，512 MiB，15 分钟超时 Cloud PubSub Subscription（推送订阅） 10 分钟超时（最大值）

video.py:视频.py：

import os
from datetime import timedelta
from concurrent.futures import ThreadPoolExecutor

import cv2

from google.cloud import storage
from google.oauth2 import service_account


def upload(blob : storage.blob.Blob, buf : "numpy.ndarray"):
    blob.upload_from_string(buf.tobytes(), content_type="image/jpeg")


def process(data : dict):
    src_client = storage.Client()
    src_bucket = src_client.get_bucket(data["bucket"])
    src_blob = src_bucket.get_blob(data["name"])

    pathname = os.path.dirname(data["name"])
    basename, ext = os.path.splitext(os.path.basename(data["name"]))

    signing_creds = \
        service_account.Credentials.from_service_account_file("key.json")

    url = src_blob.generate_signed_url(
            credentials=signing_creds,
            version="v4",
            expiration=timedelta(minutes=20),
            method="GET"
        )

    count = extract_frames(url, basename, pathname)


def extract_frames(
        signed_url : str,
        basename : str,
        pathname : str,
        dst_bucket_name : str = "extracted-frames"
    ) -> int:

    dst_client = storage.Client()
    dst_bucket = dst_client.get_bucket(dst_bucket_name)

    count = 0
    vid = cv2.VideoCapture(signed_url)

    with ThreadPoolExecutor() as executor:
        ret,frame = vid.read()

        while ret:
            enc_ret, buf = cv2.imencode(".jpg", frame)

            if not enc_ret:
                msg = f'Bad Encoding [Frame: {count:06}]'
            else:
                blob_name = f"{pathname}/{basename}-{count:06}.jpg"
                blob = dst_bucket.blob(blob_name)
                executor.map(upload, (blob, buf))

            count += 1
            ret,frame = vid.read()

    vid.release()
    return count

main.py:主要.py：

import base64
import json
import os

from flask import Flask, request

# import image
import video


app = Flask(__name__)


@app.route("/", methods=["POST"])
def index():
    envelope = request.get_json()
    if not envelope:
        msg = "no Pub/Sub message received"
        print(f"error: {msg}")
        return f"Bad Request: {msg}", 400

    if not isinstance(envelope, dict) or "message" not in envelope:
        msg = "invalid Pub/Sub message format"
        print(f"error: {msg}")
        return f"Bad Request: {msg}", 400

    # Decode the Pub/Sub message.
    pubsub_message = envelope["message"]

    if isinstance(pubsub_message, dict) and "data" in pubsub_message:
        try:
            data = json.loads(base64.b64decode(pubsub_message["data"]).decode())

        except Exception as e:
            msg = (
                "Invalid Pub/Sub message: "
                "data property is not valid base64 encoded JSON"
            )
            print(f"error: {e}")
            return f"Bad Request: {msg}", 400

        # Validate the message is a Cloud Storage event.
        if not data["name"] or not data["bucket"]:
            msg = (
                "Invalid Cloud Storage notification: "
                "expected name and bucket properties"
            )
            print(f"error: {msg}")
            return f"Bad Request: {msg}", 400

        try:
            # image.blur_offensive_images(data)
            video.process(data)
            return ("", 204)

        except Exception as e:
            print(f"error: {e}")
            return ("", 500)

    return ("", 500)

Answer 1

The pattern in your case, to be scalable in case where you have longer.您的情况下的模式，在您有更长的情况下可以扩展。 video in the future, is to start by splitting the video in smaller sequence (let's say 3 or 5 minutes of video) and to store these sequence in Cloud Storage.未来的视频，是先将视频分成更小的序列（比如 3 或 5 分钟的视频），然后将这些序列存储在 Cloud Storage 中。

Then a new event run a new service (or the same, according to your design) and you extract all the images.然后一个新事件运行一个新服务（或相同，根据您的设计）并提取所有图像。 If you need to have a unique source, you can name your file to have the same prefix and thus to be able to reuse it in the subsequent processes.如果您需要一个唯一的来源，您可以将您的文件命名为具有相同的前缀，从而能够在后续流程中重用它。

With the same idea of parallelisation, you can also imagine to leverage the multi cpu capacity of Cloud Run to process the video chunks in parallel, in the same instance.使用相同的并行化理念，您还可以想象利用 Cloud Run 的多 CPU 容量在同一实例中并行处理视频块。

Cloud Run (Python) 中 GCS Bucket 上传调用的 Google PubSub Push SubscriptionTimeouts

问题描述

1 个解决方案

解决方案1
0 2021-04-19 19:19:46

Cloud Run (Python) 中 GCS Bucket 上传调用的 Google PubSub Push SubscriptionTimeouts

问题描述

1 个解决方案

解决方案1 0 2021-04-19 19:19:46

解决方案1
0 2021-04-19 19:19:46