简体   繁体   English

侧输入数据未更新 - Python Apache Beam

[英]Side Input data doesn't get updated - Python Apache Beam

I'm building a pipeline with dynamic configuration data that gets updated whenever is triggered.我正在构建一个包含动态配置数据的管道,只要被触发就会更新。

There are 2 PubSub topics, topic A for the IoT data, topic B is for the configuration that will be used to transform the IoT data.有 2 个 PubSub 主题,主题 A 用于 IoT 数据,主题 B 用于将用于转换 IoT 数据的配置。

The configuration is kept in Cloud Firestore.配置保存在 Cloud Firestore 中。 When the database gets updated, Cloud Function will read and send the updated configuration to PubSub topic B.当数据库更新时,云 Function 将读取更新后的配置并将其发送到 PubSub 主题 B。

The problem is that Dataflow jobs only read configuration data at the start of the job and will never be updated.问题是数据流作业只在作业开始时读取配置数据,永远不会更新。

How can I do it so that the side input gets updated?我该怎么做才能更新侧面输入?

p = beam.Pipeline(options=options)

class Transform(beam.DoFn):
    def process(self, configuration):
        ...
        yield output

def run():
    ...
    iot_data = (p
        | 'ReadIotData' >> ReadFromPubSub(TOPIC_A)

    configuration = (p
        | 'ReadConfig' >> ReadFromPubSub(TOPIC_B)
        | 'WindowUserData' >> beam.WindowInto(
            window.GlobalWindows(),
            trigger=trigger.Repeatedly(trigger.AfterCount(1)),
            accumulation_mode=trigger.AccumulationMode.DISCARDING)
        | 'JsonLoadsUserData' >> beam.Map(lambda x: ('data', x.decode().replace('\\','')))

    output = (iot_data
        | 'transform' >> beam.ParDo(Transform(),
            beam.pvalue.AsDict(configuration))
        | 'Output' >> WriteToPubSub(TOPIC_C)

I would try running this pipeline with https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2 which has better support for side inputs.我会尝试使用https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2运行此管道,它更好地支持侧输入。

I did a similar problem recently and it didn't work in Directrunner, but worked fine in Dataflow runner.我最近做了一个类似的问题,它在 Directrunner 中不起作用,但在 Dataflow runner 中运行良好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache Beam/Dataflow 不会丢弃来自 Pub/Sub 的延迟数据 - Apache Beam/Dataflow doesn't discard late data from Pub/Sub CombineFn for Python dict in Apache Beam pipeline - CombineFn for Python dict in Apache Beam pipeline 如何定期触发 Apache Beam 侧输入? - How do I trigger Apache Beam side inputs periodically? 如何为 Apache Beam/Dataflow 经典模板(Python)和数据管道实现 CI/CD 管道 - How to implement a CI/CD pipeline for Apache Beam/Dataflow classic templates (Python) & data pipelines 在 apache-beam 中使用 python ReadFromKafka 不支持的信号:2 - ReadFromKafka with python in apache-beam Unsupported signal: 2 使用 Dataflow 和 Apache Beam (Python) 将 Pub/Sub 中的流数据发布到 BigQuery - Issues streaming data from Pub/Sub into BigQuery using Dataflow and Apache Beam (Python) Dataflow (Apache Beam) 无法写入 BigQuery - Dataflow (Apache Beam) can't write on BigQuery Apache Beam Python SDK 中是否有等效的 withFormatFunction? - Is there withFormatFunction equivalent in Apache Beam Python SDK? 从字典行定义 Python Apache Beam 模式 - Defining a Python Apache Beam schema from dictionary rows Apache beam python 在一个管道中使用多个共享处理程序 - Apache beam python to use multiple shared handler in one single pipeline
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM