[英]Dataflow job does not any produce output
I have an issue where the dataflow job actually runs fine, but it does not produce any output until the job is manually drained.我有一个问题,数据流作业实际上运行良好,但在手动耗尽作业之前它不会产生任何 output。
With the following code I was assuming that it would produce windowed output, effectively triggering after each window.使用以下代码,我假设它会产生窗口化的 output,在每个 window 之后有效触发。
lines = (
p
| "read" >> source
| "decode" >> beam.Map(decode_message)
| "Parse" >> beam.Map(parse_json)
| beam.WindowInto(
beam.window.FixedWindows(5*60),
trigger=beam.trigger.Repeatedly(beam.trigger.AfterProcessingTime(5*60)),
accumulation_mode=beam.trigger.AccumulationMode.DISCARDING)
| "write" >> sink
)
What I want is that if it has received events in a window, it should produce output after the window in any case.我想要的是,如果它在 window 中接收到事件,那么无论如何它都应该在 window 之后生成 output。 The source is a Cloud PubSub, with approximately 100 events/minute.源是 Cloud PubSub,每分钟大约有 100 个事件。
This is the parameters i use to start the job:这是我用来启动作业的参数:
python main.py \
--region $(REGION) \
--service_account_email $(SERVICE_ACCOUNT_EMAIL_TEST) \
--staging_location gs://$(BUCKET_NAME_TEST)/beam_stage/ \
--project $(TEST_PROJECT_ID) \
--inputTopic $(TOPIC_TEST) \
--outputLocation gs://$(BUCKET_NAME_TEST)/beam_output/ \
--streaming \
--runner DataflowRunner \
--temp_location gs://$(BUCKET_NAME_TEST)/beam_temp/ \
--experiments=allow_non_updatable_job \
--disk_size_gb=200 \
--machine_type=n1-standard-2 \
--job_name $(DATAFLOW_JOB_NAME)
Any ideas on how to fix this?有想法该怎么解决这个吗? I'm using apache-beam 2.22 SDK, python 3.7我正在使用 apache-beam 2.22 SDK、python 3.7
Excuse me if you are referring to 2.22, because "apache-beam 1.22" seems to be old?请问您指的是 2.22,因为“apache-beam 1.22”似乎很旧? Especially when you are using Python 3.7, you might want to try newer SDK versions such as 2.22.0 .特别是当您使用 Python 3.7 时,您可能想尝试更新的 SDK 版本,例如2.22.0 。
What I want is that if it has received events in a window, it should produce output after the window in any case.我想要的是,如果它在 window 中接收到事件,那么无论如何它都应该在 window 之后生成 output。 The source is a Cloud PubSub, with approximately 100 events/minute.源是 Cloud PubSub,每分钟大约有 100 个事件。
If you just need one pane fired per window and fixed windows every 5 mins, you can simply go with如果您只需要每 window 触发一个窗格并每 5 分钟固定 windows,您可以简单地使用 go
beam.WindowInto(beam.window.FixedWindows(5*60))
If you want to customize triggers, you can take a look at this document streaming-102 .如果你想自定义触发器,你可以看看这个文档streaming-102 。
Here is a streaming example with visualization of windowed outputs.这是一个带有窗口输出可视化的流式示例。
from apache_beam.runners.interactive import interactive_beam as ib
ib.options.capture_duration = timedelta(seconds=30)
ib.evict_captured_data()
pstreaming = beam.Pipeline(InteractiveRunner(), options=options)
words = (pstreaming
| 'Read' >> beam.io.ReadFromPubSub(topic=topic)
| 'Window' >> beam.WindowInto(beam.window.FixedWindows(5)))
ib.show(words, visualize_data=True, include_window_info=True)
If you run these code in a notebook environment such as jupyterlab, you get to debug streaming pipelines with outputs like this .如果您在笔记本环境(如 jupyterlab)中运行这些代码,您将可以使用类似这样的输出来调试流式传输管道。 Note the windows are visualized, for a period of 30 seconds, we get 6 windows as the fixed window is set to 5 seconds.注意 windows 是可视化的,在 30 秒的时间内,我们得到 6 windows,因为固定的 window 设置为 5 秒。 You can bin data by windows to see what data came in which window.可以按windows对数据进行分箱,看看window里面有哪些数据。
You can setup your own notebook runtime following instructions ;您可以按照说明设置自己的笔记本运行时; Or you can use hosted solutions provided by Google Dataflow Notebooks .或者您可以使用Google Dataflow Notebooks提供的托管解决方案。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.