简体   繁体   English

FixedWindow 显然没有在流式传输 python 光束管道中触发

[英]FixedWindow apparently not firing in streaming python beam pipeline

I am trying to build a Dataflow pipeline in python. The main input stream is coming from Pub/Sub and the main processing function takes a side input that is updated from a Pub/Sub stream fairly irregularly.我正在尝试在 python 中构建数据流管道。主输入 stream 来自 Pub/Sub,主处理 function 采用从 Pub/Sub stream 相当不规则地更新的副输入。 I have written the following code to test my design:我编写了以下代码来测试我的设计:

def print_and_return(x):
    print('---DEBUG---: ' + str(x))
    return x

def comb(x, y):
    return f'Main input: {x}, side input: {y}'

def load_side_input(pubsub_message):
    import json
    message = pubsub_message.decode("utf8")
    side_input = json.loads(message)
    return ('side', side_input)


def run(input_subscription, side_input_sub, pipeline_args=None):
    pipeline_options = PipelineOptions(
        pipeline_args, streaming=True, save_main_session=True
    )

    with Pipeline(options=pipeline_options) as pipeline:
        side_input = (
            pipeline
            | "Side impulse" >> io.ReadFromPubSub(subscription=side_input_sub)
            | "Window side" >> WindowInto(window.GlobalWindows(), trigger=trigger.Repeatedly(trigger.AfterCount(1)),
                                          accumulation_mode=trigger.AccumulationMode.DISCARDING)
            | "Parse side input" >> Map(load_side_input)
        )
        (
            pipeline
            | "Read from Pub/Sub" >> io.ReadFromPubSub(subscription=input_subscription, with_attributes=True)
            | "Window" >> WindowInto(window.FixedWindows(10))
            | "Add sideinput" >> Map(comb, y=pvalue.AsDict(side_input))
            | "Print" >> Map(print_and_return)
        )

I run it locally to test in debug mode, the load_side_input function triggers (I know because if I put a break point in it it gets hit) but the rest ( comb and print_and_return ) don't.我在本地运行它以在调试模式下进行测试, load_side_input function 触发器(我知道,因为如果我在其中放置一个断点,它就会被击中)但是 rest( combprint_and_return )不会。 My understanding is the the FixedWindow should trigger every 10 seconds on the main input and beam would match that window with the last firing of the trigger on the side input since it's in a global window, but in fact nothing happens.我的理解是 FixedWindow 应该在主输入上每 10 秒触发一次,并且光束将匹配 window 与侧输入上触发器的最后一次触发,因为它在全局 window 中,但实际上没有任何反应。

What am I missing, why is there no output?我错过了什么,为什么没有output?

EDIT:编辑:

After days of trying out things and even asking on the beam users mailing group I had the idea that it might just be the local runner acting up, and sure thing, after deploying to Dataflow the pipeline works as expected.经过几天的尝试,甚至询问 beam 用户邮件组后,我认为这可能只是本地运行器在起作用,而且可以肯定的是,在部署到 Dataflow 之后,管道按预期工作。 It's annoying to test this way though so it would still be nice to know what the problem is and how to solve it.虽然以这种方式进行测试很烦人,但知道问题是什么以及如何解决仍然是件好事。

It's not triggering since the window of the main and side input don't match (fixed vs global), so it cannot fetch the side input data.它不会触发,因为主输入和侧输入的 window 不匹配(固定与全局),因此它无法获取侧输入数据。

Since you are not using any aggregation, you can just take the main window out.由于您没有使用任何聚合,因此您可以将主要的 window 取出。 If you plan to aggregate, rewindow after the join of side input.如果您计划聚合,请在加入侧输入后重新窗口。

You have an example here:你在这里有一个例子:

Apache Beam Cloud Dataflow Streaming Stuck Side Input Apache Beam Cloud Dataflow Streaming Stuck Side Input

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 impersonate_service_account 进行流式光束管道 - How to use impersonate_service_account for streaming beam pipeline CombineFn for Python dict in Apache Beam pipeline - CombineFn for Python dict in Apache Beam pipeline Apache beam python 在一个管道中使用多个共享处理程序 - Apache beam python to use multiple shared handler in one single pipeline Dataflow into Beam Pipeline 中的附加参数 - The Additional Paramates at Dataflow into Beam Pipeline 如何为 Apache Beam/Dataflow 经典模板(Python)和数据管道实现 CI/CD 管道 - How to implement a CI/CD pipeline for Apache Beam/Dataflow classic templates (Python) & data pipelines Apache Beam Pipeline 能否用于批量编排? - Can Apache Beam Pipeline be used for batch orchestration? Apache 光束管线和毒丸 - Apache Beam Pipeline and Poison Pills 使用 Dataflow 和 Apache Beam (Python) 将 Pub/Sub 中的流数据发布到 BigQuery - Issues streaming data from Pub/Sub into BigQuery using Dataflow and Apache Beam (Python) DebeziumIO 使用 SQL 读取服务器未在 GCP 中使用 Apache 光束进行流式传输 - DebeziumIO read with SQL Server not streaming with Apache beam in GCP 在 Apache Beam/Google Cloud Dataflow 上创建文件和数据流 - Creating a file and streaming in data on Apache Beam/Google Cloud Dataflow
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM