[英]How to trigger a dataflow with a cloud function? (Python SDK)
I have a cloud function that is triggered by cloud Pub/Sub.我有一个由云 Pub/Sub 触发的云 function。 I want the same function trigger dataflow using Python SDK.
我想要使用 Python SDK 的相同 function 触发数据流。 Here is my code:
这是我的代码:
import base64
def hello_pubsub(event, context):
if 'data' in event:
message = base64.b64decode(event['data']).decode('utf-8')
else:
message = 'hello world!'
print('Message of pubsub : {}'.format(message))
I deploy the function this way:我以这种方式部署 function:
gcloud beta functions deploy hello_pubsub --runtime python37 --trigger-topic topic1
You can use Cloud Dataflow templates to launch your job.您可以使用Cloud Dataflow 模板来启动您的作业。 You will need to code the following steps:
您将需要对以下步骤进行编码:
Here is an example using your base code (feel free to split into multiple methods to reduce code inside hello_pubsub method).这是一个使用您的基本代码的示例(请随意拆分为多个方法以减少 hello_pubsub 方法中的代码)。
from googleapiclient.discovery import build
import base64
import google.auth
import os
def hello_pubsub(event, context):
if 'data' in event:
message = base64.b64decode(event['data']).decode('utf-8')
else:
message = 'hello world!'
credentials, _ = google.auth.default()
service = build('dataflow', 'v1b3', credentials=credentials)
gcp_project = os.environ["GCLOUD_PROJECT"]
template_path = gs://template_file_path_on_storage/
template_body = {
"parameters": {
"keyA": "valueA",
"keyB": "valueB",
},
"environment": {
"envVariable": "value"
}
}
request = service.projects().templates().launch(projectId=gcp_project, gcsPath=template_path, body=template_body)
response = request.execute()
print(response)
In template_body variable, parameters values are the arguments that will be sent to your pipeline and environment values are used by Dataflow service (serviceAccount, workers and network configuration).在 template_body 变量中,参数值是将发送到您的管道的 arguments 和环境值由 Dataflow 服务(服务帐户、工作人员和网络配置)使用。
LaunchTemplateParameters documentation LaunchTemplateParameters 文档
You have to embed your pipeline python code with your function.您必须使用 function 嵌入管道 python 代码。 When your function is called, you simply call the pipeline python main function which executes the pipeline in your file.
当您的 function 被调用时,您只需调用管道 python 主 function 执行文件中的管道。
If you developed and tried your pipeline in Cloud Shell and you already ran it in Dataflow pipeline, your code should have this structure:如果您在 Cloud Shell 中开发并尝试了您的管道,并且您已经在 Dataflow 管道中运行了它,那么您的代码应该具有以下结构:
def run(argv=None, save_main_session=True):
# Parse argument
# Set options
# Start Pipeline in p variable
# Perform your transform in Pipeline
# Run your Pipeline
result = p.run()
# Wait the end of the pipeline
result.wait_until_finish()
Thus, call this function with the correct argument especially the runner= DataflowRunner
to allow the python code to load the pipeline in Dataflow service.因此,调用此 function 并使用正确的参数,尤其是 runner=
DataflowRunner
以允许 python 代码在 Dataflow 服务中加载管道。
Delete at the end the result.wait_until_finish()
because your function won't live all the dataflow process long.最后删除
result.wait_until_finish()
因为您的 function 不会长期存在所有数据流过程。
You can also use template if you want.如果需要,您也可以使用模板。
A solution that worked for me is to do subprocessing: in my cloud function I subprocess my shell command that executes the file holding the pipeline:对我有用的解决方案是进行子处理:在我的云 function 中,我对执行包含管道的文件的 shell 命令进行子处理:
Ps: My dataflow reads from a subscription sub1 and writes into a new topic topic2. Ps:我的数据流从订阅sub1读取并写入新主题topic2。
subprocess.run(["python", "./file.py --input_topic 'projects/your-project/subscriptions/sub1' --output_topic 'projects/your-project/topics/topic2'"]
) subprocess.run(["python", "./file.py --input_topic 'projects/your-project/subscriptions/sub1' --output_topic 'projects/your-project/topics/topic2'"]
)
It runs your pipeline written in./file.py它运行你写在 ./file.py 中的管道
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.