[英]Dataflow Job to start based on PubSub Notification - Python
I am writing a Dataflow job which reads from BigQuery and does a few transformations.我正在编写一个从 BigQuery 读取数据并进行一些转换的数据流作业。
data = (
pipeline
| beam.io.ReadFromBigQuery(query='''
SELECT * FROM `bigquery-public-data.chicago_crime.crime` LIMIT 100
''', use_standard_sql=True)
| beam.Map(print)
)
But my requirement is to read from BigQuery only after receiving a notification from a PubSub Topic.但我的要求是仅在收到来自 PubSub 主题的通知后才从 BigQuery 读取。 The above DataFlow job should start reading data from BigQuery only if the below message is received.
仅当收到以下消息时,上述 DataFlow 作业才应开始从 BigQuery 读取数据。 If it is a different job id or a different status, then no action should be done.
如果它是不同的作业 ID 或不同的状态,则不应执行任何操作。
PubSub Message : {'job_id':101, 'status': 'Success'}
Any help on this part?这部分有什么帮助吗?
That is fairly easy, the code would look like this这很简单,代码看起来像这样
pubsub_msg = (
pipeline
| beam.io.gcp.pubsub.ReadFromPubSub(topic=my_topic, subscription=my_subscription)
)
bigquery_data = (
pubsub_msg
| beam.Filter(lambda msg: msg['job_id']==101) # you might want to use a more sophisticated filter condition
| beam.io.ReadFromBigQuery(query='''
SELECT * FROM `bigquery-public-data.chicago_crime.crime` LIMIT 100
''', use_standard_sql=True)
)
bigquery_data | beam.Map(print)
However, if you do it like that you will have a streaming DataFlow job running (indefinitely, or until you cancel the job), since using ReadFromPubSub
results automatically in a streaming job.但是,如果你这样做,你将有一个流数据流作业运行(无限期地,或者直到你取消作业),因为在流作业中使用
ReadFromPubSub
会自动产生结果。
If you require a batch job, I would recommend using a Dataflow template , and starting this template with a Cloud Function which listens to your PubSub topic.如果您需要批处理作业,我建议您使用Dataflow 模板,并使用侦听您的 PubSub 主题的Cloud Function启动此模板。 The logic of the filtering would then be within this CloudFunction (as a simple
if
condition).过滤的逻辑将在此 CloudFunction 中(作为一个简单的
if
条件)。
I ended up using Cloud Functions, added the filtering logic in it and starting the Dataflow from there.我最终使用了 Cloud Functions,在其中添加了过滤逻辑并从那里启动了数据流。 Found the below link useful.
发现以下链接很有用。 How to trigger a dataflow with a cloud function?
如何用云触发数据流 function? (Python SDK)
(蟒蛇开发工具包)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.