基于 PubSub 通知启动的数据流作业 - Python

Question

I am writing a Dataflow job which reads from BigQuery and does a few transformations.我正在编写一个从 BigQuery 读取数据并进行一些转换的数据流作业。

data = (
    pipeline
    | beam.io.ReadFromBigQuery(query='''
    SELECT * FROM `bigquery-public-data.chicago_crime.crime` LIMIT 100
    ''', use_standard_sql=True)
    | beam.Map(print)
)

But my requirement is to read from BigQuery only after receiving a notification from a PubSub Topic.但我的要求是仅在收到来自 PubSub 主题的通知后才从 BigQuery 读取。 The above DataFlow job should start reading data from BigQuery only if the below message is received.仅当收到以下消息时，上述 DataFlow 作业才应开始从 BigQuery 读取数据。 If it is a different job id or a different status, then no action should be done.如果它是不同的作业 ID 或不同的状态，则不应执行任何操作。

PubSub Message : {'job_id':101, 'status': 'Success'}

Any help on this part?这部分有什么帮助吗？

Answer 1

That is fairly easy, the code would look like this这很简单，代码看起来像这样

pubsub_msg = (
   pipeline
   | beam.io.gcp.pubsub.ReadFromPubSub(topic=my_topic, subscription=my_subscription)
)

bigquery_data = (
    pubsub_msg
    | beam.Filter(lambda msg: msg['job_id']==101)   # you might want to use a more sophisticated filter condition
    | beam.io.ReadFromBigQuery(query='''
    SELECT * FROM `bigquery-public-data.chicago_crime.crime` LIMIT 100
    ''', use_standard_sql=True)
)
bigquery_data | beam.Map(print)

However, if you do it like that you will have a streaming DataFlow job running (indefinitely, or until you cancel the job), since using ReadFromPubSub results automatically in a streaming job.但是，如果你这样做，你将有一个流数据流作业运行（无限期地，或者直到你取消作业），因为在流作业中使用ReadFromPubSub会自动产生结果。

If you require a batch job, I would recommend using a Dataflow template , and starting this template with a Cloud Function which listens to your PubSub topic.如果您需要批处理作业，我建议您使用Dataflow 模板，并使用侦听您的 PubSub 主题的Cloud Function启动此模板。 The logic of the filtering would then be within this CloudFunction (as a simple if condition).过滤的逻辑将在此 CloudFunction 中（作为一个简单的if条件）。

Answer 2

I ended up using Cloud Functions, added the filtering logic in it and starting the Dataflow from there.我最终使用了 Cloud Functions，在其中添加了过滤逻辑并从那里启动了数据流。 Found the below link useful.发现以下链接很有用。 How to trigger a dataflow with a cloud function? 如何用云触发数据流 function？ (Python SDK) (蟒蛇开发工具包)

基于 PubSub 通知启动的数据流作业 - Python

问题描述

2 个解决方案

解决方案1
2 2022-12-21 07:10:41

解决方案2
1 已采纳 2023-01-16 05:47:15

基于 PubSub 通知启动的数据流作业 - Python

问题描述

2 个解决方案

解决方案1 2 2022-12-21 07:10:41

解决方案2 1 已采纳 2023-01-16 05:47:15

解决方案1
2 2022-12-21 07:10:41

解决方案2
1 已采纳 2023-01-16 05:47:15