简体   繁体   English

如何处理 Google Dataflow 中缺少来自 Pub/Sub 的消息

[英]How to handle lack of messages from Pub/Sub in Google Dataflow

The below is just random example code:以下只是随机示例代码:

import argparse
import apache_beam as beam
from apache_beam.transforms import window
from apache_beam.transforms.trigger import AfterWatermark, AccumulationMode

class FormatDoFn(beam.DoFn):
    def process(self, viewing, window=beam.DoFn.WindowParam):
        print(viewing)

def main(argv=None):

    parser = argparse.ArgumentParser()
    known_args, pipeline_args = parser.parse_known_args(argv)

    with beam.Pipeline(argv=pipeline_args) as p:

        # Read from PubSub messages.
        input = p | beam.io.ReadFromPubSub("projects/example-project/topics/example")

        transformed = (
            input
            | beam.WindowInto(
                                window.FixedWindows(5),
                                trigger = AfterWatermark(),
                                accumulation_mode = AccumulationMode.DISCARDING
                            )

            | 'Format' >> beam.ParDo(FormatDoFn())
        )

if __name__ == '__main__':
    main()

I have a working Dataflow job that processes Pub/Sub messages per fixed window interval and inserts the aggregated results to a table.我有一个工作的数据流作业,它按固定窗口间隔处理 Pub/Sub 消息,并将聚合结果插入到表中。 However, it only makes inserts when messages are received.但是,它仅在收到消息时才进行插入。

How can I make the dataflow job insert a Zero Value row once a window has passed without any messages received.一旦窗口过去但没有收到任何消息,我如何让数据流作业插入一个零值行。

The solution turned out to be very simple for my use case.对于我的用例,该解决方案非常简单。 I just generated dummy KVs and combined it with the regular Pubsub message transformed KVs.我刚刚生成了虚拟 KV,并将其与常规 Pubsub 消息转换后的 KV 相结合。

PCollection<Long> input = pipeline
                            .apply("unbounded longs",
                                    GenerateSequence
                                            .from(0)
                                            .withRate(2, Duration.standardSeconds(1))
                            );

As mentioned by @CaptainNabla and @Bruno Volpato, you can usetimers which will allow delayed processing.正如@CaptainNabla 和@Bruno Volpato 所提到的,您可以使用允许延迟处理的计时器 You can keep the count of every half minute in a state and use a timer to emit the results periodically and drop the unnecessary counts.您可以将每半分钟的计数保持在一个状态,并使用计时器定期发出结果并删除不必要的计数。 The timer can be set or unset.可以设置或取消设置定时器 For more information, you can use Timely and Stateful processing.有关更多信息,您可以使用及时和有状态的处理。 For more information, you can check this link .有关更多信息,您可以查看此链接

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM