简体   繁体   English

基于 PubSub 通知启动的数据流作业 - Python

[英]Dataflow Job to start based on PubSub Notification - Python

I am writing a Dataflow job which reads from BigQuery and does a few transformations.我正在编写一个从 BigQuery 读取数据并进行一些转换的数据流作业。

data = (
    pipeline
    | beam.io.ReadFromBigQuery(query='''
    SELECT * FROM `bigquery-public-data.chicago_crime.crime` LIMIT 100
    ''', use_standard_sql=True)
    | beam.Map(print)
)

But my requirement is to read from BigQuery only after receiving a notification from a PubSub Topic.但我的要求是仅在收到来自 PubSub 主题的通知后才从 BigQuery 读取。 The above DataFlow job should start reading data from BigQuery only if the below message is received.仅当收到以下消息时,上述 DataFlow 作业才应开始从 BigQuery 读取数据。 If it is a different job id or a different status, then no action should be done.如果它是不同的作业 ID 或不同的状态,则不应执行任何操作。

PubSub Message : {'job_id':101, 'status': 'Success'}

Any help on this part?这部分有什么帮助吗?

That is fairly easy, the code would look like this这很简单,代码看起来像这样

pubsub_msg = (
   pipeline
   | beam.io.gcp.pubsub.ReadFromPubSub(topic=my_topic, subscription=my_subscription)
)

bigquery_data = (
    pubsub_msg
    | beam.Filter(lambda msg: msg['job_id']==101)   # you might want to use a more sophisticated filter condition
    | beam.io.ReadFromBigQuery(query='''
    SELECT * FROM `bigquery-public-data.chicago_crime.crime` LIMIT 100
    ''', use_standard_sql=True)
)
bigquery_data | beam.Map(print)

However, if you do it like that you will have a streaming DataFlow job running (indefinitely, or until you cancel the job), since using ReadFromPubSub results automatically in a streaming job.但是,如果你这样做,你将有一个流数据流作业运行(无限期地,或者直到你取消作业),因为在流作业中使用ReadFromPubSub会自动产生结果。

If you require a batch job, I would recommend using a Dataflow template , and starting this template with a Cloud Function which listens to your PubSub topic.如果您需要批处理作业,我建议您使用Dataflow 模板,并使用侦听您的 PubSub 主题的Cloud Function启动此模板。 The logic of the filtering would then be within this CloudFunction (as a simple if condition).过滤的逻辑将在此 CloudFunction 中(作为一个简单的if条件)。

I ended up using Cloud Functions, added the filtering logic in it and starting the Dataflow from there.我最终使用了 Cloud Functions,在其中添加了过滤逻辑并从那里启动了数据流。 Found the below link useful.发现以下链接很有用。 How to trigger a dataflow with a cloud function? 如何用云触发数据流 function? (Python SDK) (蟒蛇开发工具包)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 数据流 PubSub 到 Elasticsearch 模板代理 - Dataflow PubSub to Elasticsearch Template proxy 使用 PubSub 在本地运行 java 数据流管道 - Running java dataflow pipeline locally with PubSub 本地 Pubsub 模拟器不适用于 Dataflow - Local Pubsub Emulator won't work with Dataflow Cloud Dataflow 中的失败作业:启用 Dataflow API - Failed job in Cloud Dataflow: enable Dataflow API 数据流作业提取元信息 - Dataflow Job extracting meta information 在 Composer 上运行数据流作业时引用 setup.py 文件 - Reference a setup.py file when Running Dataflow Job on Composer 如何将非模板化的梁作业转换为模板化作业并在 GCP Dataflow 运行器上运行? - How to convert a non-templated beam job to templated job and run it on GCP Dataflow runner? 数据流作业失败,java.lang.UnsupportedOperationException: BundleFinalizer 不受非便携式数据流支持 - Dataflow Job Fails stating, java.lang.UnsupportedOperationException: BundleFinalizer unsupported by non-portable Dataflow 数据流流作业错误:未定义名称“函数”。 如何为所有数据流步骤创建全局函数 - Dataflow Streaming Job Error: name 'function' is not defined. How to create a global function for all Dataflow steps 从 Composer 触发时,DataFlow 作业启动时间过长 - DataFlow Job Startup Takes Too Long When triggered from Composer
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM