简体   繁体   English

如何使用云 function 触发数据流? (Python SDK)

[英]How to trigger a dataflow with a cloud function? (Python SDK)

I have a cloud function that is triggered by cloud Pub/Sub.我有一个由云 Pub/Sub 触发的云 function。 I want the same function trigger dataflow using Python SDK.我想要使用 Python SDK 的相同 function 触发数据流。 Here is my code:这是我的代码:

import base64
def hello_pubsub(event, context):   
    if 'data' in event:
        message = base64.b64decode(event['data']).decode('utf-8')
    else:
        message = 'hello world!'
    print('Message of pubsub : {}'.format(message))

I deploy the function this way:我以这种方式部署 function:

gcloud beta functions deploy hello_pubsub  --runtime python37 --trigger-topic topic1

You can use Cloud Dataflow templates to launch your job.您可以使用Cloud Dataflow 模板来启动您的作业。 You will need to code the following steps:您将需要对以下步骤进行编码:

  • Retrieve credentials检索凭据
  • Generate Dataflow service instance生成数据流服务实例
  • Get GCP PROJECT_ID获取 GCP PROJECT_ID
  • Generate template body生成模板正文
  • Execute template执行模板

Here is an example using your base code (feel free to split into multiple methods to reduce code inside hello_pubsub method).这是一个使用您的基本代码的示例(请随意拆分为多个方法以减少 hello_pubsub 方法中的代码)。

from googleapiclient.discovery import build
import base64
import google.auth
import os

def hello_pubsub(event, context):   
    if 'data' in event:
        message = base64.b64decode(event['data']).decode('utf-8')
    else:
        message = 'hello world!'

    credentials, _ = google.auth.default()
    service = build('dataflow', 'v1b3', credentials=credentials)
    gcp_project = os.environ["GCLOUD_PROJECT"]

    template_path = gs://template_file_path_on_storage/
    template_body = {
        "parameters": {
            "keyA": "valueA",
            "keyB": "valueB",
        },
        "environment": {
            "envVariable": "value"
        }
    }

    request = service.projects().templates().launch(projectId=gcp_project, gcsPath=template_path, body=template_body)
    response = request.execute()

    print(response)

In template_body variable, parameters values are the arguments that will be sent to your pipeline and environment values are used by Dataflow service (serviceAccount, workers and network configuration).在 template_body 变量中,参数值是将发送到您的管道的 arguments 和环境值由 Dataflow 服务(服务帐户、工作人员和网络配置)使用。

LaunchTemplateParameters documentation LaunchTemplateParameters 文档

RuntimeEnvironment documentation 运行时环境文档

You have to embed your pipeline python code with your function.您必须使用 function 嵌入管道 python 代码。 When your function is called, you simply call the pipeline python main function which executes the pipeline in your file.当您的 function 被调用时,您只需调用管道 python 主 function 执行文件中的管道。

If you developed and tried your pipeline in Cloud Shell and you already ran it in Dataflow pipeline, your code should have this structure:如果您在 Cloud Shell 中开发并尝试了您的管道,并且您已经在 Dataflow 管道中运行了它,那么您的代码应该具有以下结构:

def run(argv=None, save_main_session=True):
  # Parse argument
  # Set options
  # Start Pipeline in p variable
  # Perform your transform in Pipeline
  # Run your Pipeline
  result = p.run()
  # Wait the end of the pipeline
  result.wait_until_finish()

Thus, call this function with the correct argument especially the runner= DataflowRunner to allow the python code to load the pipeline in Dataflow service.因此,调用此 function 并使用正确的参数,尤其是 runner= DataflowRunner以允许 python 代码在 Dataflow 服务中加载管道。

Delete at the end the result.wait_until_finish() because your function won't live all the dataflow process long.最后删除result.wait_until_finish()因为您的 function 不会长期存在所有数据流过程。

You can also use template if you want.如果需要,您也可以使用模板。

A solution that worked for me is to do subprocessing: in my cloud function I subprocess my shell command that executes the file holding the pipeline:对我有用的解决方案是进行子处理:在我的云 function 中,我对执行包含管道的文件的 shell 命令进行子处理:
Ps: My dataflow reads from a subscription sub1 and writes into a new topic topic2. Ps:我的数据流从订阅sub1读取并写入新主题topic2。

subprocess.run(["python", "./file.py --input_topic 'projects/your-project/subscriptions/sub1' --output_topic 'projects/your-project/topics/topic2'"] ) subprocess.run(["python", "./file.py --input_topic 'projects/your-project/subscriptions/sub1' --output_topic 'projects/your-project/topics/topic2'"] )

It runs your pipeline written in./file.py它运行你写在 ./file.py 中的管道

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从 Python 云函数触发/调用 Cloud Build - How do I trigger/call Cloud Build from a Python Cloud Function 如何在特定文件集从谷歌云到达云存储时启动云数据流管道 function - how to launch a cloud dataflow pipeline when particular set of files reaches Cloud storage from a google cloud function Cloud Dataflow Python3作业无法解决依赖关系 - Cloud dataflow python3 job not solving dependencies Google Cloud Dataflow - 在管道选项中提供 sdk_location - Google Cloud Dataflow - providing an sdk_location in pipeline options 如何使用 Python 在 Google Cloud 构建中创建具有多个替换变量的触发器 - How to create trigger with multiple substitutions variable in Google Cloud build with Python Cloud Run 的存储触发功能 - Storage trigger function with Cloud Run 如何使用 Cloud Logging Python SDK 获取给定接收器的 writerIdentity - How to get writerIdentity for a given Sink using Cloud Logging Python SDK Apache 光束侧输入在流数据流管道中不工作 Python SDK - Apache Beam Side Inputs not working in Streaming Dataflow Pipeline with Python SDK 使用 Apache Beam python 创建谷歌云数据流模板时出现 RuntimeValueProviderError - RuntimeValueProviderError when creating a google cloud dataflow template with Apache Beam python Python:如何设置触发函数的时间? - Python: how to set a time to trigger a function?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM