[英]Side Input data doesn't get updated - Python Apache Beam
I'm building a pipeline with dynamic configuration data that gets updated whenever is triggered.我正在构建一个包含动态配置数据的管道,只要被触发就会更新。
There are 2 PubSub topics, topic A for the IoT data, topic B is for the configuration that will be used to transform the IoT data.有 2 个 PubSub 主题,主题 A 用于 IoT 数据,主题 B 用于将用于转换 IoT 数据的配置。
The configuration is kept in Cloud Firestore.配置保存在 Cloud Firestore 中。 When the database gets updated, Cloud Function will read and send the updated configuration to PubSub topic B.当数据库更新时,云 Function 将读取更新后的配置并将其发送到 PubSub 主题 B。
The problem is that Dataflow jobs only read configuration data at the start of the job and will never be updated.问题是数据流作业只在作业开始时读取配置数据,永远不会更新。
How can I do it so that the side input gets updated?我该怎么做才能更新侧面输入?
p = beam.Pipeline(options=options)
class Transform(beam.DoFn):
def process(self, configuration):
...
yield output
def run():
...
iot_data = (p
| 'ReadIotData' >> ReadFromPubSub(TOPIC_A)
configuration = (p
| 'ReadConfig' >> ReadFromPubSub(TOPIC_B)
| 'WindowUserData' >> beam.WindowInto(
window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterCount(1)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| 'JsonLoadsUserData' >> beam.Map(lambda x: ('data', x.decode().replace('\\','')))
output = (iot_data
| 'transform' >> beam.ParDo(Transform(),
beam.pvalue.AsDict(configuration))
| 'Output' >> WriteToPubSub(TOPIC_C)
I would try running this pipeline with https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2 which has better support for side inputs.我会尝试使用https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2运行此管道,它更好地支持侧输入。
I did a similar problem recently and it didn't work in Directrunner, but worked fine in Dataflow runner.我最近做了一个类似的问题,它在 Directrunner 中不起作用,但在 Dataflow runner 中运行良好。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.