简体   繁体   中英

How to use Google Cloud Functions / Tasks / PubSub for Batch Processing?

We are currently using Rabbit MQ with Celery on some VMs for this:

  • We have a batch of tasks we want to process in parallel (eg process some files concurrently or run some machine learning inference for images)
  • When the batch is done we callback our App, get the results and start some other batch of tasks which might depend on the results of the previously executed tasks

So we have the requirements:

  • Our App needs to know when a batch is done
  • Our App needs the gathered task results across the batch
  • It might kill the App when we do a callback to the App in every single task that succeeds

Now we try to use Google Cloud for this and we would like to move away from VMs to something like Google Cloud Tasks or Pub / Sub in combination with Google Cloud Functions. Is there any best practice setup for our problem in Google Cloud?

I think you need an architect to re-design your solution to lift in the cloud. It is good time to check whether you want to move to managed products or would prefer just the same in the cloud.

Talking about the products:

  • Rabbit QM should be replaced with Pub/Sub, which fits pretty good. If you would like to keep using RabbitMQ here . PubSub should be the best choice if you want to move most of your solution to the Google Cloud, and in the long term could bring more benefits in the Gooogle Cloud ecosystem.
  • Dataflow is a good batch processor. Here is an example of PubSub - Dataflow: Quickstart: stream processing with Dataflow . There are Google-Provided batch templates or you can create one: traditional or Flex .

Don't rush and pick a solution. It is well worth to check all your business and technical requirements and explore the benefits of each product (managed or not) of the Google Cloud. The more detailed your requirements are, the best you can design your solution.

Google Cloud offers, today, only one workflow manager named Cloud Composer (based on Apache Airflow project ) ( I don't take into account the AI Platform workflow manager ( AI Pipeline ) ). This managed solution allow you to perform the same things than you do today with Celery

  • An event occur
  • A Cloud Function is called to process the event
  • The Cloud Function trigger a DAG (Diagram Acyclic Graph - a workflow in Airflow)
  • A step in the DAG runs a lot of sub process (Cloud Function/Cloud Run/anything else) wait the end, and continue to the next step...

2 warnings:

  • Composer is expensive (about $400 per month, for the minimal config)
  • DAG are acyclic. no loops are authorized

Note: A new workflow product should come on GCP. No ETA for now, and at the beginning the parallelism want be managed. IMO, this solution is the right one for you, but not for short term, maybe in 12 months

About the MQTT queue, you can use PubSub , very efficient and affordable.

Alternative

You can build your own system following this process

  • An event occur
  • A Cloud Function is called to process the event
  • The cloud function create as many PubSub message as batched are required.
  • For each message generated, you write an entry into Firestore with the initial event, and the messageId
  • The generated messages are consumed (by Cloud Function, Cloud Run or anything else) and at the end of the process, the Firestore entry is updated saying that the sub process has been completed
  • You plug a Cloud Function on Firestore event On Write. The function checks if all the subprocess for an initial event are completed. If so, go to the next step...

We have implemented a similar workflow in my company. But it's not easy to maintain and to debug when a problem occur. Else, it works great.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM