[英]FixedWindow apparently not firing in streaming python beam pipeline
I am trying to build a Dataflow pipeline in python. The main input stream is coming from Pub/Sub and the main processing function takes a side input that is updated from a Pub/Sub stream fairly irregularly.我正在尝试在 python 中构建数据流管道。主输入 stream 来自 Pub/Sub,主处理 function 采用从 Pub/Sub stream 相当不规则地更新的副输入。 I have written the following code to test my design:我编写了以下代码来测试我的设计:
def print_and_return(x):
print('---DEBUG---: ' + str(x))
return x
def comb(x, y):
return f'Main input: {x}, side input: {y}'
def load_side_input(pubsub_message):
import json
message = pubsub_message.decode("utf8")
side_input = json.loads(message)
return ('side', side_input)
def run(input_subscription, side_input_sub, pipeline_args=None):
pipeline_options = PipelineOptions(
pipeline_args, streaming=True, save_main_session=True
)
with Pipeline(options=pipeline_options) as pipeline:
side_input = (
pipeline
| "Side impulse" >> io.ReadFromPubSub(subscription=side_input_sub)
| "Window side" >> WindowInto(window.GlobalWindows(), trigger=trigger.Repeatedly(trigger.AfterCount(1)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| "Parse side input" >> Map(load_side_input)
)
(
pipeline
| "Read from Pub/Sub" >> io.ReadFromPubSub(subscription=input_subscription, with_attributes=True)
| "Window" >> WindowInto(window.FixedWindows(10))
| "Add sideinput" >> Map(comb, y=pvalue.AsDict(side_input))
| "Print" >> Map(print_and_return)
)
I run it locally to test in debug mode, the load_side_input
function triggers (I know because if I put a break point in it it gets hit) but the rest ( comb
and print_and_return
) don't.我在本地运行它以在调试模式下进行测试, load_side_input
function 触发器(我知道,因为如果我在其中放置一个断点,它就会被击中)但是 rest( comb
和print_and_return
)不会。 My understanding is the the FixedWindow should trigger every 10 seconds on the main input and beam would match that window with the last firing of the trigger on the side input since it's in a global window, but in fact nothing happens.我的理解是 FixedWindow 应该在主输入上每 10 秒触发一次,并且光束将匹配 window 与侧输入上触发器的最后一次触发,因为它在全局 window 中,但实际上没有任何反应。
What am I missing, why is there no output?我错过了什么,为什么没有output?
EDIT:编辑:
After days of trying out things and even asking on the beam users mailing group I had the idea that it might just be the local runner acting up, and sure thing, after deploying to Dataflow the pipeline works as expected.经过几天的尝试,甚至询问 beam 用户邮件组后,我认为这可能只是本地运行器在起作用,而且可以肯定的是,在部署到 Dataflow 之后,管道按预期工作。 It's annoying to test this way though so it would still be nice to know what the problem is and how to solve it.虽然以这种方式进行测试很烦人,但知道问题是什么以及如何解决仍然是件好事。
It's not triggering since the window of the main and side input don't match (fixed vs global), so it cannot fetch the side input data.它不会触发,因为主输入和侧输入的 window 不匹配(固定与全局),因此它无法获取侧输入数据。
Since you are not using any aggregation, you can just take the main window out.由于您没有使用任何聚合,因此您可以将主要的 window 取出。 If you plan to aggregate, rewindow after the join of side input.如果您计划聚合,请在加入侧输入后重新窗口。
You have an example here:你在这里有一个例子:
Apache Beam Cloud Dataflow Streaming Stuck Side Input Apache Beam Cloud Dataflow Streaming Stuck Side Input
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.