简体   繁体   中英

How to call a Dataflow job written in Go from Cloud Functions in GCP

My goal is creating a mechanism that when a new file is uploaded into the Cloud Storage, it'll trigger a Cloud Function. Eventually, this Cloud function will trigger a Cloud Dataflow job.

I have a restriction that the Cloud Dataflow job should be written in Go, and the Cloud Function should be written in Python.

The problem I have been facing right now is, I cannot call Cloud Dataflow job from a Cloud Function.

The problem in Cloud Dataflow written in Go is there is no template-location variable defined in Apache Beam Go SDK. That's why I cannot create dataflow templates. And, since there is no dataflow templates, the only way that I can call Cloud Dataflow job from a cloud function is writing a Python job which calls a bash script which runs dataflow job.

The bash script looks like that:

go run wordcount.go \
--runner dataflow \
--input gs://dataflow-samples/shakespeare/kinglear.txt \
--output gs://${BUCKET?}/counts \
--project ${PROJECT?} \
--temp_location gs://${BUCKET?}/tmp/ \
--staging_location gs://${BUCKET?}/binaries/ \
--worker_harness_container_image=apache-docker-beam-snapshots-docker.bintray.io/beam/go:20180515

But above mechanism cannot create a new dataflow job and it seems it's cumbersome.

Is there a better way to achieve my goal? And what am I doing wrong on the above mechanism?

the Cloud Function should be written in Python

The Cloud Dataflow Client SDK can only create Dataflow jobs from templates. Therefore this requirement cannot be achieved unless you create your own template.

I have a restriction that the Cloud Dataflow job should be written in Go

Since your Python objective cannot be achieved, your other option is to run your Go program in Cloud Functions. Cloud Functions for Go is in alpha . However, I know of no method to execute an Apache Beam (Dataflow) program in Cloud Functions. Keep in mind that an Apache Beam programs begins execution locally and connects itself to a cluster running somewhere else (Dataflow, Spark, etc.) unless you select runner=DirectRunner .

You have chosen the least mature language to use Apache Beam. The order of maturity and features is Java (excellent), Python (good and getting better everyday), Go (not ready yet for primetime).

If you want to run Apache Beam programs written in Go on Cloud Dataflow, then you will need to use a platform such as your local system, Google Compute Engine or Google App Engine Flex. I do not know if App Engine Standard can run Apache Beam in Go.

I found out that the Apache Beam Go SDK supports worker_binary parameter which is similar to template-location for Java Dataflow jobs. Using this option, I was able to kick off a go dataflow job from my python cloud function.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM